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ABSTRACT 

The  Naval  Postgraduate  School's  Student  Opinion  Form 
data  were  subjected  to  study  through  the  use  of  two  cluster 
analysis  techniques:   (1)  K-MEANS  partitioning  method  and 
(2)  Chernoff 's  FACES.   Much  developmental  work  was  performed 
to  tailor  these  methods  to  the  special  requirements  of  the 
data  set.   A  thorough  multivariate  statistical  review  pro- 
vided the  basis  for  choosing  optimality  criteria  and  distance 
functions  for  use  in  the  MIKCA  (Multivariate  Iterative  K-MEANS 
Clustering  Algorithm) .   Alterations  were  made  to  the  computer 
code  to  allow  the  analysis  to  include  the  effect  of  class 
size  on  cluster  membership.   Use  of  the  linear  discriminant 
function  aided  in  identifying  variables  for  use  in  constructing 
features  of  the  computer-drawn  faces .   This  approach  to  the 
Chernoff ' s  FACES  technique  shows  promise  but  needs  further 
development.   A  principal  components  analysis  of  the  data 
showed  it  to  be  essentially  one  dimensional.   Partitioning 
the  data  into  four  clusters  shows  that  the  scoring  of  the 
courses  varies  inversely  with  class  size. 
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I.   INTRODUCTION 

The  Student  Opinion  Form  (SOF)  used  at  the  Naval  Post- 
graduate School  provides  an  organized  information  gathering 
mechanism  about  each  course  (and  its  instructor)  as  per- 
ceived by  the  students .   The  information  obtained  from  the 
SOF  data  is  used  for  administrative  review  of  faculty 
performance  and  for  feedback  to  the  instructor  to  aid  in 
self -development.   The  former  use  is  hampered  by  the  fact 
that  the  data  are  multivariate  in  nature  and  represent  a 
complicated  set  of  interactions  between  the  instructor's 
performance,  the  nature  of  the  course,  and  the  group  of 
students.   There  is  need  for  methodology  which  can  disen- 
tangle those  interactions  and  summarize  the  data  in  a 
meaningful  way. 

It  is  the  purpose  of  this  thesis  to  develop  suitable 
cluster  analysis  methods  for  studying  the  data  and  dis- 
covering any  hidden  structure  they  may  possess.   Concurrently, 
a  certain  amount  of  exploratory  data  analysis  took  place, 
and  those  results  are  reported  also. 

At  the  completion  of  every  quarter,  students  are  requested 
to  respond  to  a  16-item  SOF  questionnaire  for  each  course 
in  which  they  are  enrolled.   The  data  are  viewed  as  an  n  by  p 
matrix,  representing  n  observations  (SOFs) ,  each  of  which  is 
measured  on  p  (16)  different  variables.   For  this  research 
the  mean  vector  of  each  course  was  computed.   Then  attempts 
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were  made  to  discover  natural  clusters  of  these  mean  vectors 
which  in  turn  can  be  interpreted  as  the  underlying  structure 
in  the  data.   Since  the  number  of  students  per  course  is 
quite  variable,  the  mean  vectors  are  not  equally  well 
determined.   Also,  the  matrix  of  mean  vectors  may  have  a 
covariance  structure  quite  different  from  that  of  the  full 
n  by  p  data  matrix. 

The  clustering  objective  was  pursued  by  two  multivariate 
statistical  methods:   one  computer-graphic  technique  referred 
to  as  Chernoff's  FACES,  and  a  second,  more  mathematically 
oriented  approach  called  K-MEANS.   The  former  produces  computer- 
drawn  cartoon  faces,  the  features  of  which  are  controlled 
by  variables  in  the  data.   The  assignment  of  variables  to 
features  was  aided  by  the  use  of  linear  discriminant  analysis. 
One  face  is  produced  from  each  course  mean  vector,  and  then 
the  researcher  is  able  to  study  the  appearance  of  the  faces 
and  cluster  together  those  that  display  similar  character- 
istics.  The  second  method  utilizes  a  computer  program 
called  MIKCA  (Multivariate  Iterative  K-MEANS  Clustering 
Algorithm)  which  is  based  on  the  K-MEANS  method.   It  forms 
an  initial  partition  of  the  data  and  then  transfers  obser- 
vations between  clusters  in  order  to  improve  an  optimality 
criterion  function.   In  this  iterative  manner,  MIKCA  ulti- 
mately stabilizes  and  provides  an  "optimal"  cluster  solution. 

In  addition  a  modified  MIKCA  technique  was  employed. 
Alterations  were  made  to  the  basic  computer  code  to  enable 
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the  program  to  weight  each  mean  vector  by  the  number  of 
students  in  the  course.   This  modification  may  be  likened 
to  a  one-way  Analysis  of  Variance  (ANOVA)  having  unbalance 
in  the  number  of  observations  per  treatment.   The  result  is 
to  stabilize  the  relative  variability  of  the  various  course 
mean  vectors. 

Most  multivariate  analysis  methodology  is  derived  assuming 
the  data  have  a  multivariate  normal  distribution  with  common 
covariance  matrix.   The  performance  of  the  MIKCA  program 
and  the  linear  discriminant  analysis  will  not  depend  greatly 
upon  this  assuraption  provided  the  clusters  are  well  defined. 
On  the  other  hand,  if  the  clusters  are  not  well  separated, 
the  results  of  the  programs  will  be  sensitive  to  these 
assumptions,  and  this  is  the  condition  anticipated.   Accordingly, 
a  transformation  was  sought  toward  this  end.   The  one 
selected  is  essentially  a  logistic  function. 

It  is  frequently  necessary  to  compare  the  agreement  of 
cluster  solutions  produced  under  different  conditons  or  by 
different  methods.  For  this  purpose,  a  computer  program 
was  written  which  provides  an  ad  hoc  measure  of  the  amount 
of  agreement  between  the  results  of  two  or  more  solutions. 
A  niimber  between  zero  and  one,  called  the  comparison  coeffi- 
cient, is  the  resulting  measure  of  association. 

This  thesis  was  largely  exploratory  and  should  serve 
as  a  firra  foundation  for  future  study  of  the  SOF  data  in 
particular  and  similarly  structured  multivariate  data  sets 
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in  general.   A  number  of  unexpected  questions  are  raised 
during  the  exploratory  phases  of  this  research.   It  was 
not  possible  to  answer  many  of  these  questions,  and  their 
consideration  is  left  to  other  researchers.   During  the 
development  of  the  methodologies,  some  new  and  challenging 
problems  were  encountered.   Many  of  these  had  to  be  given 
rather  short  treatment  in  the  interest  of  meeting  the  original 
objectives.   It  should  be  emphasized  that  although  some  very 
interesting  facts  are  revealed  in  this  thesis,  the  results 
are  by  no  means  considered  to  describe  completely  the 
information  hidden  in  the  data. 
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II.   CLUSTER  ANALYSIS 

A.   ORIGIN  AND  THEORY 

Cluster  analysis  is  the  name  given  to  a  body  of  diverse 
techniques  for  discovering  taxonomical  structure  within  bodies 
of  data.   It  is  one  of  several  methodologies  included  in 
the  broader  category  called  classification.   In  cluster 
analysis  little  or  nothing  is  known  about  the  category 
structure.   All  that  is  available  is  a  collection  of  obser- 
vations whose  category  memberships  are  unknown,  and  one  must 
discover  a  category  structure  which  fits  the  observations. 
The  objective  is  to  find  the  natural  groups  by  sorting  the 
observations  such  that  the  association  is  high  among  members 
of  the  same  group  and  low  between  members  of  different  groups . 
The  great  challenge  to  the  researcher  is  finding  the  most 
appropriate  way  of  defining  "natural  groups"  and  "association." 
Cluster  analysis  is  closely  related  to  and  often  confused 
with  discriminant  analysis,  a  statistical  procedure  for 
assigning  new  obseirvations  to  known  groups.   In  contrast 
to  discriminant  analysis,  clustering  refers  to  discovery  of 
the  initial  groups. 

Although  modern  clustering  techniques  began  development 
in  biological  taxonomy,  they  are  generally  applicable  to 
all  types  of  data.   Any  method  which  partitions  a  set  of 
objects  into  subsets  on  the  basis  of  measurements  taken  on 
every  object  qualifies  as  a  clustering  method.   Cluster 
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analysis  techniques  are  most  often  applied  in  multivariate 
settings,  that  is  where  each  of  n  observations  is  measured 
on  p  different  variables.   A  clear  intuitive  picture  of  the 
concept  is  helpful  in  appreciating  the  value  of  cluster 
analysis  and  the  situations  to  which  it  might  be  applied. 
In  a  geometric  sense,  every  object  (observation)  may  be 
viewed  as  a  point  in  p-dimensional  Euclidean  space.   This 
swarm  of  data  points  may  contain  dense  regions  or  "clouds" 
of  data  points  which  are  separable  from  other  regions 
containing  a  low  density  of  points.   These  denser  regions 
constitute  what  are  known  as  clusters.   In  the  one  and  two 
dimensional  cases,  it  is  easy  for  the  human  eye  to  quickly 
detect  the  clusters  from  scatter  plots,  assuming  that  the 
clusters  exist.   In  higher  dimensions,  clustering  attempts 
become  extremely  difficult  without  the  aid  of  computers. 

Solutions  to  the  clustering  problem  usually  involve  the 
determination  of  a  partition  which  satisfies  some  optimality 
criterion.   The  optimality  criterion  is  a  way  of  measuring 
how  good  a  particular  cluster  solution  is  relative  to  other 
solutions.   An  astounding  number  of  possible  solutions  exist 
Reference  1  describes  a  Stirling  number  of  the  second  kind 
representing  the  number  of  ways  n  objects  may  be  sorted 
into  m  groups . 


m 

s(^)   =   A-   J  (-1)'^"''  O    k^ 
n       m!   '^^  k 
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The  number  of  groups  is  usually  unknown  so  the  problem  is 
compounded/  and  the  total  number  of  possibilities  is  a  sum 
of  Stirling  numbers.   In  the  case  of  25  observations,  the 
total  number  of  possible  cluster  solutions  is 

j=i  ^^ 

18 
which  exceeds  4  x  10   .   This  illustrates  that  the  enumerative 

technique  for  finding  solutions  can  require  huge  amounts  of 

computer  time,  and  there  exists  a  need  for  a  better  way. 

Modern  techniques  allow  solutions  to  be  found  without 
evaluating  the  criterion  for  each  and  every  solution.   How- 
ever the  need  for  ranking  solutions  is  evident,  and  the 
criterion  function  serves  to  meet  this  need.   A  wide  variety 
of  such  functions  exists,  and  the  choice  is  usually  determined 
by  the  particular  characteristics  of  the  research  being  con- 
ducted.  A  more  detailed  discussion  of  optimality  criteria 
is  presented  in  Section  II. C. 

Mathematical  clustering  techniques  usually  call  for  a 
concept  of  distance  between  objects.   In  order  to  solve  the 
cluster  problem,  it  is  desirable  to  define  the  terms  "simi- 
larity" and  "difference"  in  a  quantitative  fashion.   What 
does  it  mean  to  say  two  objects  are  different?   Perhaps  an 
investigator  would  assign  two  observations  to  the  same  group 
if  the  distance  between  them  is  sufficiently  small,  or  to 
different  clusters  if  this  distance  is  sufficiently  large. 
Common  reference  to  the  closeness  of  objects  is  made  in 
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units  of  length,  weight,  or  time.   Numerous  methods  for 
measuring  distance  will  be  discussed  in  Section  II. D. 

In  the  following  discussion,  X.  and  X.  represent  two 
points  in  p-dimensional  Euclidean  space  (E  )  corresponding 
to  objects  or  observations.   Any  non-negative  real-valued 
function  D(X.,X.)  satisfying  the  following  conditions 
qualifies  as  a  distance  function  (or  metric) . 

a.  D(X.,X.)  ^0  for  all  X.  and  X .  in  E 

■'-J  J  c 

b.  D(x.,X.)  =  0  if  and  only  if  X.  =  X. 

c.  D(X^,Xj)  =  D(Xj,X^) 

d.  D(Xi,X.)  <  D(X^,X^)  +  D{X^,X.) 

where  X.,  X.,  and  X,  are  any  three  points  in  E  .   Later 
discussions  will  place  particular  emphasis  on  the  Mahalanobis 
metric. 

The  use  of  cluster  analysis  is  applicable  in  nearly 
every  field  of  study.   The  literature  is  both  voluminous  and 
diverse,  the  terminology  differing  from  one  field  to  another. 
"Numerical  taxonomy"  is  frequently  substituted  for  cluster 
analysis  among  biologists,  botanists,  and  ecologists,  while 
some  social  scientists  may  prefer  "typology."   Other  fre- 
quently encountered  terms  are  pattern  recognition  and  par- 
titioning.  While  discriminant  analysis  has  been  studied  by 
statisticians  for  nearly  45  years,  cluster  analysis  has  only 
recently  come  to  statistical  notice. 

Cluster  analysis  is  an  exploratory  device,  a  rool  for 
suggestion  and  discovery.   A  question  often  asked  is  "How  do 
you  know  when  you  have  a  good  set  of  clusters?"   The  answer 
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is  that  the  clusters  themselves  are  not  interesting;  the 
point  of  interest  is  in  inference  about  the  structure  of 
the  data.   The  clusters  do  not  explain  the  structure;  they 
are  consequences  of  the  structure.   The  explanatory  struc- 
ture is  the  object  of  the  search  and  its  description  is 
in  terms  of  principles  and  ideas,  not  individual  data  units. 

It  is  important  to  realize  that  a  given  set  of  data 
may  contain  no  "right"  classification,  but  possibly  many 
different,  meaningful  classifications.   It  could  be  the 
case  that  the  data  contain  no  clusters  at  all. 

B.   SCATTER  MATRIX  DECOMPOSITION 

Described  in  this  section  are  the  multivariate  terminology 
and  notation  to  be  used  on  this  thesis.   The  literature 
contains  as  many  different  notational  structures  as  authors. 
The  emphasis  is  on  simplicity,  while  also  exposing  the  reader 
to  some  of  the  more  common  terminology. 

In  general,  multivariate  data  are  viewed  as  an  n  by  p 
matrix  referred  to  as  X.   It  represents  n  observations,  each 
of  which  consists  of  measurements  on  p  different  variables. 
The  cross  products  matrix  is  analogous  to  the  univariate  sum 
of  squared  deviations  from  the  mean  and  is  represented  by 
the  p  by  p  matrix  T. 


g  "i 

1=11       U..    -   x__)(x..  -  x__)' 


i=l  j=l 
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where 


X.  .   is  the  j-th  observation  vector  in  the  i-th 
^   group. 

X    is  the  grand  mean  vector  of  the  data. 

g    is  the  number  of  groups . 

n.    is  the  number  of  observations  in  the  i-th 
1 

group . 

Prime  notation  indicates  transpose. 

All  vectors  are  column  vectors.   Cross  product  matrices 

are  also  referred  to  as  scatter  matrices.   Division  of  T 

by  n-1  (where  n  represents  the  total  number  of  observations) 

yields  the  total  variance-covariance  matrix,  sometimes  referred 

to  as  a  dispersion  matrix. 

The  total  sum  of  squares  (cross  products)  matrix  may 
be  expressed  as  the  sum  of  the  within-group  and  the  between- 
group  scatter  matrices: 

T   =   W  +  B 

W  and  B  are  defined  as  follows: 


W  = 


B   = 


g  ^i 

I         I       (X..  -  x.J(x,.  -  X,J 

i=l  j=l 


g 

y  n .  (x .   -  X   )  (x.   -  X   ) 

i=l 
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where  x.^  is  the  mean  vector  of  the  i-th  group.   Each 
individual  group  has  its  own  scatter  matrix  W . ,  and  W  is 
the  sum  of  these  matrices: 


g 

w   =     y  w. 

'■■    1 

i=l 


This  discussion  is  intended  to  be  completely  general, 
with  no  particular  group  structure  in  mind.   Later  we 
shall  explore  the  differences  in  the  two  group  structures 
represented  by  the  SOF  data. 

(1)  Individual  SOFs  are  considered  to  be  the  obser- 
vations and  the  courses  are  the  groups . 

(2)  The  course  mean  vectors  are  the  observations  and 
the  clusters  of  courses  are  the  groups. 

These  two  group  structures  are  different  ways  of  viewing 
the  data;  their  relationship  shall  be  explained  in  Section 
IV. B. 

C.   0PTI>1ALITY  CRITERIA 

Most  of  the  well  known  clustering  techniques  fall  into 
one  of  two  main  categories:   (1)  hierarchical  and  (2)  par- 
titioning.  The  former  class  is  one  in  which  every  cluster 
obtained  at  any  stage  is  a  merger  of  clusters  at  previous 
stages.   The  non-hierarchical  procedures  however  form  new 
clusters  by  lumping  and  splitting  old  ones. 

Partitioning  methods  were  used  in  this  research.   The 
main  idea  is  to  choose  some  initial  partition  and  then  alter 


the  cluster  membership  in  an  effort  to  improve  the  partition. 
Different  interpretations  of  what  constitute  a  "better" 
partition  and  numerous  ways  of  achieving  this  improvement 
have  led  to  a  great  variety  of  algorithms.   These  methods 
are  related  to  the  steepest  descent  algorithms  used  for 
unconstrained  optimization  in  nonlinear  programming.   Such 
algorithms  begin  with  an  initial  point  and  then  converge 
to  a  local  optimum,  moving  one  step  at  a  time,  the  value  of 
the  objective  function  improving  at  each  step.   A  well  known 
example  is  the  ISODATA  procedure  developed  by  Ball  and  Hall 
at  Stanford  Research  Institute.   Chapter  IV  discusses  a 
partitioning  method  known  as  K-MEANS  which  was  developed 
by  MacQueen  [2] .   He  uses  the  term  "K-MEANS"  to  denote  the 
process  of  assigning  each  data  unit  to  that  cluster  (of 
k  clusters)  with  the  nearest  centroid  (mean  vector).   The 
cluster  centroids  change  with  each  transfer  of  an  observation. 

The  decomposition  of  the  total  scatter  into  within 
and  between  components  suggests  possible  optimality  criteria 
to  be  used  in  a  clustering  algorithm.   One  would  like  the 
within-groups  scatter  to  be  small  relative  to  the  between- 
groups  scatter.   Various  trial  clusterings  could  be  formed 
using  the  W  and  B  matrices  as  a  basis  for  the  optimaltiy 
criteria  which  determine  the  best  clustering.   A  possible 
choice  for  a  criterion  is  to  minimize  trace  W  over  all 
partitions  into  g  groups.   Since  T  is  constant  over  all 
partitions,  minimizing  trace  W  is  equivalent  to  maximizing 
trace  B  since 
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trace  T  =   trace  W  +  trace  B 

Although  trace  W  is  invariant  under  an  orthogonal  transfor- 
mation, it  is  not  invariant  under  other  non-singular  linear 
transformations . 

McRae  [3]  points  out  that  trace  W  equals  the  total  within 
group  sum  of  squares,  hence  the  "minimum  variance  partition" 
cluster  solution  is  found  by  minimizing  trace  W. 

Considerable  study  has  been  devoted  to  alternative 
criteria  such  as  those  based  on  multivariate  statistical 
analysis  techniques,  especially  the  methods  of  linear 
discriminant  analysis  and  multivariate  analysis  of  variance. 
Assuming  the  p  variables  are  not  linearly  dependent,  then 
as  long  as  p  <_  n-g,  VJ  is  positive  definite  symmetric  and 
so  is  W   .   Attempts  to  make  B  and  W  as  different  as  possible 
lead  one  to  solving  the  deteirrainantal  equation: 

|B  -  AWl   =   0 

The  solutions  A.  are  the  eigenvalues  of  the  matrix  W  B. 
There  are  t  non-zero  eigenvalues,  where  t  is  the  minimum 
of  p  and  g-1.   This  is  a  consequence  of  the  fact  that,  if 
g  is  less  than  p,  the  g  group  means  are  contained  in  a 
(g-1) -dimensional  hyperplane.   When  g  =  2  the  analysis  is 
equivalent  to  two-group  discriminant  analysis.   Linear 
discriminant  analysis  would  take  the  vectors  originally 
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described  in  a  p-dimensional  coordinate  system  and  trans- 
form the  basis  to  a  t-dimensional  system.   Maximizing  the 
largest  of  these  eigenvalues  is  a  criterion  suggested  by 
S.N.  Roy.   Maximizing  the  trace  of  W  B,  however  is  a 
criterion  known  as  Hotelling's  trace  criterion.   In  both 
cases,  large  values  for  these  statistics  are  sought  in 
clustering  algorithms  since  large  values  indicate  large 
differences  among  (between)  groups.   Minimizing  the  ratio 
of  determinants  |w|  —    |t|  is  a  criterion  widely  known  as 
Wilks '  lambda.   Since  T  is  the  same  for  all  partitions, 
this  criterion  is  equivalent  to  minimizing  det  W. 

Both  trace  W  B  and  |T|  -^  |W|  may  be  expressed  in  terms 
of  the  eigenvalues  of  W  B. 

t 

n  (1  +  \^) 

i=l 


w 


t 

trace  W~-^B   =    J    X. 

i=l 

where  t  =  min(p,g-l).   Therefore  minimizing  det  W  is 
equivalent  to  maximizing  tt(1+A.). 

Friedman  and  Rubin  [4]  describe  the  advantages  of  the 
various  criteria.   Those  based  on  multivariate  statistical 
considerations  (all  but  trace  W)  are  invariant  under  changes 
in  scale  for  the  variables  (non-singular  linear  transformation) 
In  fact,  they  are  the  only  invariants  for  W  and  B  under  such 
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transformations.   In  addition,  the  multivariate  criteria 
may  take  into  account  covariation  among  the  variables. 

D.   DISTANCE  CONSIDERATIONS 

As  indicated  earlier  there  exist  a  number  of  choices 
for  measuring  distance  between  objects.   The  choice  of 
distance  function  if  no  less  important  than  the  choice  of 
variables  to  be  used  in  the  study.   A  serious  difficulty 
lies  in  the  fact  that  knowledge  of  the  clusters  changes  the 
choice  of  distance  functions .   In  the  computation  of  the 
distance,  a  variable  which  distinguishes  well  between  two 
established  clusters  might  be  weighted  more  heavily  than 
others.   Friedman  and  Rubin  describe  this  difficulty  as 
the  "bootstrap"  nature  of  the  problem.   Knowledge  of  the 
clusters  would  suggest  an  appropriate  distance  function 
which  in  turn  would  allow  one  to  determine  the  original 
clusters.   The  trace  W  criterion  implies  ordinary  Euclidean 
distance  and  thus  hides  this  circularity.   Use  of  the  cri- 
teria which  are  invariant  under  non-singular  linear  trans- 
formations deals  effectively  with  this  circularity. 

The  familiar  Euclidean  distance  is  illustrated  in 
figure  la.   When  p  =  2  the  geometric  interpretation  of  this 
measure  amounts  to  determining  distances  by  circles.   Two 
points  such  as  A  and  B  on  the  same  circle  are  considered 
equidistant  from  the  origin,  while  other  points  such  as 
C  and  D  are  further  from  the  origin  than  A  and  B. 

A  general  class  of  squared  distance  functions  is  provided 
by  utilizing  positive  definite  quadratic  f orras .   Specifically, 
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if  3  represents  a  p-dimensional  observation  to  be  assigned 
to  one  of  s  groups,  then  to  measure  the  squared  distance 
between  6  and  the  centroid  of  the  i-th  group  one  may 
consider  the  function 


D.   =   (B  -  X.  )*^  M  (3  -  X.  )  (1) 

1  1  •  1  • 


where  M  is  a  positive  definite  matrix  to  ensure  that 
D ■  >_  0 .   Different  metrics  are  represented  by  different 
choices  of  the  matrix  M.   When  M  =  I  (the  identity  matrix) 
the  resulting  metric  is  the  standard  Euclidean  distance. 
The  variance  within  the  data  may  make  the  unweighted 
Euclidean  metric  inappropriate.   Referring  to  figure  lb 
where  x  has  a  larger  variance  than  y,  one  may  wish  to  weight 
a  deviation  in  the  x  direction  less  than  an  equal  deviation 
in  the  y  direction.   A  method  for  accomplishing  this  is 
through  use  of  an  elliptical  (weighted  Euclidean)  distance 
function  which  makes  points  A  and  B  equidistant  from  the 
origin.   The  matrix  M  in  this  case  is  diagonal  with  diagonal 
elements  equal  to  the  reciprocals  of  the  variances  of  the 
different  variables.   Insofar  as  the  variance  represents  the 
true  structure  in  the  data,  this  distance  function  will 
adjust  for  differences  due  to  the  scale  of  measurement  of 
each  of  the  variables.   Extending  this  idea  further,  one 
may  consider  the  covariance  among  variables  as  well.   Figure 
Ic  shows  how  the  axes  may  be  tilted  so  that  the  major  axis 
is  oriented  in  a  direction  of  reflecting  the  positive 
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correlation  between  x  and  y.   Again,  points  on  the  same 
ellipse  are  considered  equidistant  from  the  origin.   The 
matrix  M  in  this  case  is  the  inverse  of  the  covariance 
matrix. 

Further  examination  of  this  concept  is  an  important 
consideration  in  this  research.   If  C-  represents  the  co- 
variance  matrix  of  the  i-th  cluster  then  the  distance 
function 


D.     =     (e  -  X.    )^  c.-'-   (g  -  X.  ) 

1  1*     1        I* 


uses  the  appropriate  covariance  structure  when  determining 
distance  to  a  particular  cluster  centroid.   Note  that  the 
number  of  observations  in  every  cluster  must  exceed  the 
dimensionality  p  in  order  to  preserve  the  nonsingularity 
of  C . .   Since  C.  changes  to  reflect  the  dispersion  internal 
to  each  particular  cluster,  the  use  of  this  metric  exploits 
differences  in  the  dispersion  characteristics  of  the 
different  groups.   Figure  Id  illustrates  the  idea.   Note  how 
a  new  observation  (denoted  by  u)  is  closer  to  the  centroid 
of  group  one  (Gl)  in  terms  of  Euclidean  distance  but  is 
more  likely  to  be  assigned  to  group  two  (G2)  when  using  the 
C.  matrix.   It  is  instructive  to  point  out  here  that  if  one 
were  looking  for  boundaries  dividing  the  p-dimensional 
space  into  regions,  one  for  each  of  the  g  groups,  such 
boundaries  would  be  non-linear.   In  the  performance  of 
discriminant  analysis,  Eisenbeis  [5]  suggests  appropriate 
quadratic  classification  rules. 
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Another  choice  for  the  M  matrix  in  equation  1  is 
C   where  C  represents  the  pooled  within  groups  covariance 
matrix  of  all  the  clusters. 


C  =   ± w 

g 


I       (n.-l) 
i=l 


Recall  from  Section  II. B: 

g 

w   =     y  w, 

^   k 
k=l 

This  distance  is  the  well  known  Mahalanobis  distance. 
Note  that  C  does  not  change  from  group  to  group.   To  ensure 
the  non-singularity  of  C  it  must  be  true  that  p  <_  (n-g) 
where 


n  =    In. 


g 

y 

i=l 


n  represents  the  total  number  of  observations  over  all 
groups . 

The  use  of  the  Mahalanobis  metric  in  the  original  p- 
dimensional  space  is  equivalent  to  using  Euclidean  distance 
in  the  t-dimensional  discriminant  space  with  basis  vectors 
corresponding  to  the  eigenvectors  of  W  B.   Mote  that  the 
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determination  of  the  discriminant  space  was  based  on  the 
assumption  of  homogeneity  of  the  cluster  covariance  struc- 
ture.  The  Mahalanobis  distance  function  therefore  adjusts 
for  both  scale  of  measurement  of  the  variables  and  covaria- 
tion among  the  variables.   Use  of  this  metric  is  equivalent 
to  computing  distances  on  variables  transformed  to  their 
principal  components . 

The  natural  metric  to  use  with  the  trace  W  criterion  is 
the  Euclidean  distance.   However,  when  using  criteria  based 
on  multivariate  statistical  considerations,  Mahalanobis  is 
the  natural  metric  to  use. 

When  the  clusters  are  distributed  as  p-variate  normal 
and  have  equal  covariance  matrices,  then  Fisher's  linear 
discriminant  function  is  applicable,  as  is  the  Mahalanobis 
distance.   The  accuracy  of  the  Mahalanobis  metric  is  sensi- 
tive to  the  homogeniety  of  the  cluster  dispersions  and 
decreases  as  the  difference  between  the  group  dispersions 
increases.   Recall  the  density  function  for  the  multivariate 
normal  distribution 


^         -jCx-y)  I       (x-u) 


where  J  is  the  covariance  matrix  and  \i    is  the  mean  vector 
of  the  distribution.   Note  the  exponent  which  implies 
utilization  of  Mahalanobis  distance  is  equivalent  to 
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measurement  of  the  density  at  the  point  x.   The  empirical 
distributions  of  the  clusters  will  therefore  deterinine  the 
cluster  to  which  the  observation  should  be  assigned.   The 
following  is  a  proof  of  the  invariance  of  Mahalanobis 
distance  under  any  non-singular  linear  transformation. 
Consider  the  transformation 

Y   =   BX 

and  let  D(Y.,Y.)  represent  Mahalanobis  distance  between  Y. 
1   J     ^  1 

and  Y . . 


D(Y.,Y.)   =   (Y.  -  Y.)"^  C"-^  (Y.  -  Y.) 
1      J  1     3     Yi     3 


(BX^  -  BX.)*^  C"-'-  (BX^  -  BX^) 


T   T   -1 
(X.  -  X.)   B   C,   B  (X.  -  X. 
1     2  Y       1     j 


(X.  -  X.)"^  b"^  (BC^B*^)"-^  B(X.  -  X.) 
1     j  X  1     j 


(X.  -  x.)^  q^    ix.    -   X.) 


=   D(X. ,X.) 
13 


Some  other  common  metrics  are  defined  below. 
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1.   L,  norm  (city  block) 


P 

D(X.  ,X.)   =    y   IX,  .  -  X,  . 

13       ^   '  ki    k:) 

k=l 


2.   L  norm  (Minkowski  metrics 
P 


D(X.  ,X.)   =  {    I       IX.  .  -  X,  .  1^') 
k=l 


1/P 


3.   Uniform  norm 


D(X^,X.)   =   supremum     ^\\i    "  \-l^ 
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III.   THE  DATA  SET 

A.   ORIGIN 

The  present  Student  Opinion  Form  (SOF)  system  was 
started  in  the  summer  quarter  of  1975  when  it  replaced  the 
Student  Instruction  Report  (SIR)  obtained  from  the  Educa- 
tional Testing  Service  at  Princeton.   The  SOF  form  has  16 
questions  and  space  for  free-form  comments  from  the  students 
The  information  obtained  from  the  SOF  data  is  used  for  the 
twofold  purpose  mentioned  in  Section  I. A  of  this  paper. 

A  SOF  form  (figure  2)  should  be  completed  by  each  stu- 
dent for  each  course  segment  he  takes  for  credit.   The  term 
"course  segment"  is  used  because  the  same  course  may  be 
offered  to  more  than  one  group  of  students.   To  differen- 
tiate between  the  classes,  segment  numbers  are  assigned  and 
a  separate  SOF  identification  number  exists  for  each  segment 
Different  segments  of  the  same  course  may  or  may  not  be 
taught  by  the  same  professor.   About  20  percent  of  the  forms 
are  not  returned  to  administration  officials  due  to  lack  of 
interest  on  the  part  of  some  students  and  instructors. 
Students  have  been  inforraed  that  the  results  of  the  SOF 
data  are  used  to  assist  in  identifying  faculty  members  for 
pay  raises  and  tenure  considerations. 

Difficulties  with  legibility  of  the  completed  forms  and 
with  the  OpScan  machine  have  persisted  for  several  quarters. 
The  data  available  for  this  research  has  been  coded  with 
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indications  where  invalid  responses  occur.   Only  the  valid 
infozTOation  was  considered  in  this  thesis.   Mean  scores 
were  computed  for  every  instructor  (every  course  segment) 
from  the  valid  responses  in  each  of  the  first  13  SOF  items. 
Only  the  first  13  questions  were  used  because  of  the  high 
percentage  of  unusable  responses  in  items  14,  15,  and  16. 
Each  of  the  responses  recorded  is  an  integer  from  one  to 
five,  with  five  being  the  upper  (more  desirable)  end  of  the 
scale.   These  data  are  therefore  considered  on  an  ordinal 
scale.   Table  one  categorizes  the  blocks  of  data  which  were 
available  for  this  study.   Note  the  short  3-digit  notation 
to  be  used  in  this  paper,  indicating  calendar  year  and 
quarter  number. 


CALENDAR 

NUMBER  OF 

3-DIGIT 

YEAR 

RESPONDENTS 

CODE 

Summer  1977 

2440 

773 

Fall    1977 

2967 

774 

Winter  19  7  8 

3056 

781 

Spring  1978 

2964 
Table  1 

782 

The  majority  of  the  analysis  was  performed  using  only 
quarter  77  3.   Unless  otherwise  indicated,  future  references 
to  the  data  set  shall  imply  quarter  77  3  data. 
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B.   TRANSFORMATION 

The  need  for  a  common  covariance  structure  when  using 
the  Mahalanobis  metric  has  been  emphasized.   The  transfor- 
mation of  quarter  77  3  data  (which  attempted  to  accomplish 
homogeneity  of  dispersions)  is  explained  in  this  section. 

The  SOF  data  are  13-dimensional,  and  the  best  trans- 
formations would  involve  separate  examination  of  each  of  the 
13  variables.   Due  to  the  overwhelming   complexity  of  this 
task,  only  a  single  transformation  was  sought. 

In  the  SOF  data  the  variance  is  very  much  a  function 
of  the  mean.   In  fact,  a  course  with  a  5.0  mean  vector  has 
no  variance  whatsoever.   Similar  effects  occur  on  the  lower 
end  of  the  scale.   A  variance-stabilizing  transformation 
was  sought  which  would  help  to  relieve  the  dependence  of  the 
variance  on  the  mean.   Recall  the  normal  distribution  has 
independent  mean  and  variance.   Other  well  known  distribu- 
tions- such  as  the  Exponential,  Geometric,  and  Poisson  all 
have  related  mean  and  variance.   The  assumption  of  multi- 
variate normality  underlies  much  of  standard  classical  multi- 
variate statistical  methodology.   The  effects  of  departure 
from  normality  are  not  clearly  understood.   Although  marginal 
normality  does  not  imply  joint  mormality,  the  presence  of 
many  types  of  non-normality  is  often  reflected  in  the  marginal 
distributions  as  well.   The  marginal  distributions  of  the 
SOF  data  do  not  indicate  any  strong  departures  from  normality. 

Previous  research  by  Professor  R.R.  Read  [7]  encountered 
the  same  need  for  a  transformation  of  the  SOF  data.   The 
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following  transformation  is  due  to  Professor  Read's 
findings: 


In  ("cT )      where   a  =  .2  (1) 


The  transformation  was  used  on  SOF  item  12,  and  Bartlett's 
test  substantiated  the  presence  of  homogeneity  of  variances 
The  groups  involved  here  were  the  course  segments ,  and  the 
application  was  univariate. 

Studies  by  Professor  Glen  Lindsay  [8]  and  students  in 
his  course  on  Scaling  Techniques  produced  results  which 
suggested  slight  modifications  to  Professor  Read's  transfor- 
mation. 


1^  (#T^^)       where   a  =  2.0  (2 

5+b-x 

b  =  0.3 


The  same  study  could  be  described  equally  well  with  a  con- 
stant second  difference  model,  or  what  is  the  same  thing, 
the  function 


X^  +  C  (3) 


The  three  transformations  were  considered  in  the  following 
manner.   It  was  felt  that  the  transformation  which  would 
produce  the  most  nearly  homogeneous  covariance  structure 
would  be  best.   The  three  functions  were  applied  to  quarter 
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773  data,  and  then  statistical  tests  for  common  covariance 
were  administered.   The  test  statistic  comes  from  reference  5 
and  is  explained  in  Appendix  A.   The  results  indicated  that 
of  the  three,  the  first  log  transformation  (1)  generated 
the  most  nearly  common  covariance  structure.   The  group 
structure  whose  covariance  matrices  were  compared  came  from 
clusters  formed  by  the  MICKA  algorithm  (to  be  discussed  in 
the  next  chapter).   On  the  basis  of  the  test  results,  the 
data  were  transfoirmed  by  function  (1)  ,  and  all  subsequent 
references  to  the  data  shall  imply  the  transformed  data. 

Functions  (1)  and  (2)  are  shown  together  on  the  graph 
in  figure  3.   The  one  chosen  for  use  is  the  lower  curve. 

C.   PRINCIPAL  COMPONENTS 

Recall  the  breakdown  of  the  cross  products  matrix  into 
the  sum  of  the  within  and  between  scatter  matrices .   When 
considering  the  observations  as  individual  SOFs  (and  the 
groups  as  courses) ,  the  cross  products  matrix  will  be  called 
the  idaster  scatter  matrix  with  decomposition: 

M   =   S  +  T 

where  S  is  the  within  course  scatter  and  T  is  the  between 
course  scatter.   It  is  reemphasized  that,  in  this  equation, 
the  groups  are  the  courses.   The  breakdown  of  the  master 
scatter  matrix  may  be  examined  before  any  clustering  of 
course  means  is  performed  because  the  group  structure 
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(courses  are  groups)  is  known.   The  discussion  is  enhanced 
by  an  algebraic  description  of  the  matrices  involved.   Let 


N   ^^i 
M      =  I  I        (X..  -  X_)(X.j  -  X..) 

i=l  j=l 


,,  ns . 
N     1 


where 


S   = 


T  = 


I     y    (X. .  -  X.  )  (X. .  -  X. 
i=l  j=l 

N 

y  ns. (X.   -  X   ) (X.   -  X   ) ' 

L  11«        ••1*        *• 

i=l 


X 


. .   is  the  j-th  SOF  response  form  from  the  i-th 


course. 


X.  is  the  mean  vector  of  the  i-th  course. 
1* 

X^ ^  is  the  grand  mean. 

ns.  is  the  number  of  students  in  the  i-th  course 

N  is  the  total  number  of  courses . 


Since  T  represents  the  dispersion  of  the  course  means,  it 
is  the  main  object  of  the  clustering  efforts.   It  is  natural 
to  ask  also,  how  much  information  is  in  S.   To  this  end  a 
principal  components  analysis  was  performed  on  the  covariance 
matrices : 
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s 

= 

1       T 
N-1    ^ 

=s 

= 

1 

N 

I       (ns^-1) 

i=l 

N-1  =  189 


S     J  (ns.-l)  =  1993 

for  quarter  77  3  data 


Anderson  [6J  describes  principal  components  as  the  axes  of 

a  coordinate  system  with  special  statistical  properties. 

The  principal  components  form  a  new  coordinate  system 

resulting  from  linear  transformations  of  the  variables 

which  produce  the  special  properties  in  terms  of  variances. 

The  idea  is  to  describe  the  data  swarm  by  a  new  set  of 

orthogonal  coordinates  so  that  the  sample  variances  with 

respect  to  the  new  coordinates  are  in  decreasing  order. 

If  the  eigenvalues  of  the  covariance  matrix  are  ordered, 

i.e..  A,  >  X^    >    ...  >  A  ,  then  the  variance  in  the  new 
12  p 

coordinate  system  is  greatest  in  the  dimension  associated 
with  A,,  next  greatest  in  the  dimension  associated  with  \^, 
etc.   The  sum  of  the  eigenvalues  is  the  total  variance  in 
the  original  coordinate  system. 

The  results  of  the  principal  components  analysis  are 
shown  in  Table  2.   First,  it  is  of  interest  to  compute  how 
much  of  the  total  energy  in  M  is  accounted  for  by  T. 

TOTAL   =   18.8(1993)  +  156(189)   =   66952 
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PRINCIPAL  COMPONENTS  ANALYSIS 


EIGENVALUES 

C   EIGENVECTOR 
^         FOR  A^ 

^S 

EIGENVALUES 

Ce  EIGENVECTOR 
For  X^3 

1 

0.44 

0.28 

0.35 

-0.27 

2 

136.97 

0.30 

0.47 

-0.30 

3 

4.64 

0.28 

0.49 

-0.28 

4 

0.46 

0.29 

0.51 

-0.29 

5 

3.66 

0.23 

0.52 

-0.19 

6 

2.57 

0.19 

0.63 

-0.19 

7 

1.14 

0.25 

0.63 

-0.24 

8 

1.63 

0.27 

0.68 

-0.29 

9 

1.47 

0.32 

0.77 

-0.33 

10 

0.66 

0.30 

0.89 

-0.33 

11 

1.23 

0.26 

1.00 

-0.28 

12 

0.96 

0.35 

1.10 

-0.30 

13 

0.84 

0.26 

11.00 

-0.27 

TOTAL   156 


18.8 


TABLE  2 

T  accounts  for  29484  -j-  66952  =  44  percent  of  the  total. 
This  indicates  that  a  great  deal  of  variability  must  there- 
fore be  accounted  for  within  the  courses  (i.e.,  with  the 
students)  . 

The  principal  components  analysis  of  C_  shows  the  first 
principal  component  accounts  for  55  percent  of  its  total 
variance,  but  all  other  coordinate  directions  each  account 
for  6  percent  or  less.   Moreover,  the  direction  of  the  first 
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component  is  essentially  the  main  diagonal  of  13  space, 
i.e.,  the  signs  are  all  the  same  and  so  are  the  magnitudes 
(approximately) .   Thus  the  data  swarm  may  be  thought  of 
as  an  elongated  ellipsoid  directed  along  the  main  diagonal 
and  having  spheroidal  (more  or  less)  cross  section.   In 
particular,  this  suggests  that  the  students  within  a  course 
tend  to  score  all  13  components  more  or  less  the  same  (all 
high,  all  moderate,  or  all  low) ,  but  perceptions  from  student 
to  student  differ. 

Turning  to  the  principal  components  analysis  of  C  ,  it 
is  seen  that  85  percent  of  the  total  variability  is  accounted 
for  by  the  first  principal  component,  and  the  second  accounts 
for  only  three  percent.   Thus  the  data  swarm  of  course  means 
may  be  viewed  as  essentially  one  dimensional.   Reference  to 
its  eigenvector  reveals  no  single  SOF  item  or  group  of  SOF 
items  is  heavily  weighted  relative  to  the  others  and  that  the 
signs  are  again  all  the  same.   Thus,  this  component  is 
similarly  shaped  along  the  main  diagonal  of  13  space,  but 
more  extremely  elongated. 

Some  exploratory  work  was  done  on  the  within  class 
variability  (S)  to  see  if  the  "number  of  quarters  completed" 
by  students  has  any  effect  on  the  variability  represented 
by  S.   Figure  four  presents  the  results  with  a  graph  plotting 
within  course  variance  versus  time  on  board.   Note  the 
tendency  for  the  variability  to  drop  off  in  later  quarters, 
possibly  indicating  more  perfunctory  completion  of  the  forms. 
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IV.   THE  MIKCA  METHOD 

A.   THE  ALGORITHM 

The  specific  algorithm  chosen  for  the  cluster  analysis 
is  the  MIKCA  (Multivariate  Iterative  K-MEANS  Clustering 
Algorithm)  program  written  by  Douglas  J.  McRae  as  a  part 
of  his  doctoral  dissertation  at  the  University  of  North 
Carolina,  Chapel  Hill. 

Reference  to  the  flow  chart  in  figure  5  will  aid  the 
reader  in  the  following  discussion  of  the  algorithm.   Inputs 
to  the  program  are  the  data  matrix,  an  estimate  for  g  (the 
number  of  clusters) ,  and  choice  of  criterion  and  distance 
functions . 

In  the  first  step,  preliminary  calculations  are  made, 
such  as  the  variable  means  and  standard  deviations,  as  well 
as  the  cross  products  matrix  T.   The  next  step  forms  the 
initial  cluster  solution.   A  random  choice  of  s  observations 
serves  as  the  initial  cluster  centers.   Then  each  of  the 
other  observations  is  assigned  to  the  nearest  cluster. 
Euclidean  distance  is  used  for  this  initial  phase,  and  the 
cluster  centroids  are  recomputed  after  each  observation  is 
assigned  to  a  group.   The  observations  are  considered  in  the 
same  order  as  they  were  input.   After  all  of  them  have  been 
assigned  to  clusters,  the  criterion  value  is  computed.   This 
initial  cluster-finding  technique  is  referred  to  as  a  one- 
pass  K-MEANS  procedure.   It  is  performed  three  times,  and 
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the  solution  which  yields  the  best  criterion  value  is  chosen 
as  the  initial  cluster  solution. 

After  the  initial  solution  has  been  found,  the  program 
advances  to  the  iterative  K-MEANS  phase  where  the  observa- 
tions are  again  considered  in  the  order  in  which  they  were 
input  to  the  program.   It  is  this  phase  where  the  user's 
choice  of  distance  function  is  used.   The  distance  from  each 
observation  to  each  cluster  centroid  is  again  computed,  this 
time  with  the  user's  distance  function,  the  assignment  to  the 
closest  centroid  being  made  and  the  centroid  updated  to 
reflect  its  new  membership.   After  considering  all  n  obser- 
vations in  this  manner,  the  new  criterion  value  is  checked 
for  possible  improvement  during  the  K-MEANS  iteration.   As 
long  as  the  criterion  value  improves,  the  K-MEANS  procedure 
is  repeated;  if  the  criterion  fails  to  improve  then  the  MIKCA 
algorithm  goes  to  the  next  step,  the  individual  switches 
section. 

Note  the  importance  of  the  order  of  consideration  of  the 
observations.   The  order  is  important  because  the  cluster 
means  are  recomputed  after  each  observation  is  reassigned. 

In  the  individual  switches  phase,  consideration  is  given 
to  moving  each  observation  to  every  other  cluster,  the  move 
being  made  if  and  only  if  an  improvement  in  the  value  of  the 
criterion  results.   An  elaborate  labelling  procedure  pro- 
vides a  unique  order  in  which  to  consider  each  observation. 
This  procedure  continues  until  a  complete  pass  through  the 
data  is  made  with  no  changes  in  cluster  membership. 
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The  MIKCA  algorithm  provides  the  following  options  for 
distance  and  criterion  functions. 

CRITERION 

1.  Minimum  trace  W 

2.  Minimum  det  W 

3.  Maximura  largest  order  of    |b-AW|    =   0 

4 .  Maximiim  sum  of   roots   of       |  B-AW  |    =    0 

DISTANCE 

1.  Euclidean 

2.  Weighted  Euclidean 

3.  Mahalanobis 

Using  R.A.  Fisher's  iris  data,  McRae  tested  his  algorithm 
and  produced  extremely  good  results .   Using  the  det  W 
criterion  and  Mahalanobis  distance,  MIKCA  produced  a  solu- 
tion identical  to  the  classification  given  by  multiple 
discriminant  analysis.   This  is  a  notable  achievement  since 
the  cluster  procedure,  which  does  not  know  the  true  composi- 
tion before  the  analysis,  makes  the  same  final  classification 
of  observations  as  does  the  discriminant  procedure,  which 
bases  its  analysis  on  the  group  composition  information. 

The  MIKCA  provides  as  output  the  value  of  the  criterion 
function,  the  cluster  membership,  and  the  cluster  mean  vectors 
Also  provided  are  two  matrices,  T  and  W.   The  program  was 
written  in  FORTRAN  IV  for  the  IBM  360  series  of  computers. 
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B.   MODIFIED  MIKCA 

Initially,  the  MIKCA  program  was  used  with  the  p-component 
mean  responses  for  each  course  as  the  input  data  matrix. 
Since  the  number  of  students  utilized  in  producing  these 
means  is  quite  variable,  these  input  vectors  are  not  equally 
well  determined  and,  as  has  been  mentioned  earlier,  may 
effect  the  covariance  structure  between  the  objects.   It 
is  desirable  to  have  the  option  of  weighting  these  course 
means  in  order  to  effect  a  better  balance  in  terms  of  their 
accuracy  and  to  reduce  any  consequential  distortion  in  the 
covariances.   It  is  convenient  to  refer  to  this  modification 
as  the  "1  man  1  vote"  option,  and  to  the  original  technique 
as  the  "1  course  1  vote"  option.   The  following  algebraic 
definitions  will  aid  in  illustrating  the  weighting  effect. 

Recall  the  breakdown  of  the  master  scatter  matrix  into 
the  sum  of  within  and  between  matrices. 

M   =   S  +  T 

V7hen  the  mean  scores  are  computed  for  each  course  and  used 
as  inputs  to  MIKCA,  then  a  different  dispersion  structure 
takes  form.   The  groups  are  no  longer  the  known  courses, 
but  are  now  the  object  of  the  problem.   The  groups  are  unknown 
clusters  of  courses  (or  professors) .   Let  T*  denote  the 
total  scatter  contained  in  the  data  when  each  observation 
represents  a  course  mean  vector.   T*  may  also  be  expressed 
as  the  sum  of  within  and  between  scatter  matrices. 
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T*   =   W*  +  B^ 


These  matrices  are  defined  as  follows: 


no 
g    s 


where 


s=l  k=l 


g    s 

w*    =     y     y    (x  ,  -  X    )  (X  ,  -  X    )• 

'-        '■'  sk    s.  '  sk    s* 

s=l  k=l 


g 

B*   =    y  no  (x    -  X   )  (x    -  X   )' 
L        ss*     ''s      •• 

s=l 


nc    is  the  number  of  observations  (courses)  in 
s 

the  s-th  cluster. 


is  the  number  of  clusters. 


X  ,   is  the  k-th  observation  (course  mean  vector] 
sk 

in  the  s-th  cluster 


X    is  the  mean  vector  of  the  s-th  cluster 
s  •  _ 

I   ""sk 
^^s 


X,  .   is  the  grand  mean 


I  I   ST 

,   sk 
s 
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Note  that  the  grand  mean  mentioned  here  is  not  the  same  as 
the  grand  mean  used  in  the  decomposition  of  the  master 
matrix  M.   The  difference  between  T  and  T*  is  that  T  is 
weighted  by  the  number  of  students  in  each  course,  ns . .   This 
weighting  factor  was  lost  when  the  individual  observations 
were  viewed  as  the  class  mean  vectors.   A  close  algebraic 
examination  of  T  will  illustrate  its  weighted  property. 
Originally/  we  had  M  =  S  +  T  where: 

N 

T   =    I   ns^(x^,  -  X..)  (x^.  -  x.j' 
i=l 

It  is  now  helpful  to  show  the  decomposition  of  T. 

T   =   W  +  B 


Let  X.   become  x  ,  (k-th  course  mean  in  s-th  cluster)  and 
1*         sk 

ns .  become  ns  ,   (number  of  students  in  k-th  course  of  s-th 
cluster) .   Therefore  the  same  T  can  be  reexpressed  as 


nc 

g    s 

T   =    y    y   ns  ,  (x  ,  -  X   )  (x  ,  -  x   )  ' 
^    ^  sk   sk     •  •    sk     •  • 

3=1  k=l 


Letting 


nc 

s 

W   =    y   ns  ,  (x  ,  -  x^)  (x  ,  -  x^)  ' 
s      ^     sk   sk    s    sk    s 

k=l 
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where 


^^s 


1 

's     nc 


—      k=l 

X   =  — (weighted  mean  vector  of  s-th 


s  cluster) 


k=l 


then 


g 

w   =     y  w 

^   s 
s=l 


and 


T  =  W  +  B      (B   is  obtained  by  subtraction) 

The  understanding  of  this  distinction  is  important  because 
it  describes  the  abbreviated  (unweighted)  dispersion  upon 
which  MIKCA  bases  its  cluster  solution. 

A  number  of  changes  were  made  to  the  MIKCA  computer  code 
to  allow  for  a  system  of  weights ,  ns . ,  for  the  course  means . 
The  modified  code  extends  the  capability  of  MIKCA  by  making 
this  option  available.   It  amounts  to  using  T  rather  than 
T*  as  the  basic  dispersion  structure.   This  seems  more  natural 
because  the  matrix  T  appears  in  the  earlier  decomposition. 

M   =   S  +  T   =   S  +  W  +  B 
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Some  of  the  changes  are  summarized  here : 

1.  Allow  for  class  size  as  input. 

2.  Alter  the  computation  of  T  to  allow  for  the  weighting 
factor. 

3.  Alter  the  computation  of  cluster  centroids  to  allow 
for  weighting. 

4.  Alter  calculations  of  the  B  matrix  for  the  same 
reason.   (W  is  found  by  subtracting  B  from  T.) 

The  computer  code  for  the  modified  MIKCA  is  included  in 
Appendix  F . 

Cluster  solutions  using  both  weighted  (T)  and  unweighted 
(T*)  dispersion  structures  were  found  and  compared  (see 
table  3  in  next  section) .   The  comparison  indicates  some 
differences  in  cluster  solutions,  however  the  importance 
of  these  differences  is  left  to  the  reader. 

C.   RESULTS 

Several  cluster  solutions  were  formed  using  the  MIKCA 
algorithm.   It  seemed  wise  to  include  the  number  of  students 
in  a  course  as  the  14-th  variable.   The  natural  logarithm 
of  the  class  size  was  the  transformation  applied  to  this 
variable.   Since  class  sizes  ranged  from  two  to  40,  this 
transfoinnation  brought  the  values  into  a  similar  range  as  the 
other  13  variables  and  also  reduced  skewness.   For  quarter 
773  the  mean  class  size  was  12.7  students  with  a  standard 
deviation  of  7.9.   For  the  transformed  variable  these 
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statistics  are  2.3  and  0.7.   Cluster  solutions  were  found 
with  and  without  inclusion  of  this  14-th  variable.   The 
results  are  shown  in  table  three. 

Another  option  available  to  the  MIKCA  user  is  the  stan- 
dardization of  variables  prior  to  entering  the  clustering 
process.   McRae  points  out  how  this  option  becomes  very 
useful  when  the  variables  are  on  vastly  different  scales  of 
measurement.   Except  for  the  14-th  variable  the  present 
scales  are  psychological  in  nature  and  seem  to  be  much  the 
same.   Some  exploratory  work  was  perfoinned  with  the  standar- 
dization option  (see  table  three)  but  it  was  not  considered 
significant  because  of  the  similarity  in  the  scales  of 
measurement. 

Table  three  shows  the  comparisons  of  cluster  results 
obtained  under  various  conditions.   The  comparison  coeffi- 
cient provides  a  measure  of  agreement  between  solutions  and 
is  computed  by  a  method  introduced  in  Chapter  VII.   Table 
three  shows  generally  higher  values  for  g  =  3 ,  indicating 
that  there  exists  robustness  of  solutions  for  the  smaller 
values  of  g. 

The  results  of  these  cluster  solutions  may  also  be  seen 
in  graphical  form  by  referring  to  Appendix  B.   These  graphs, 
called  profile  charts,  depict  the  mean  vectors  for  each  of 
the  clusters  formed  by  the  MIKCA  algorithm.   The  mean  vectors 
have  been  standardized  so  that  one  can  see  the  number  of 
standard  deviations  from  the  grand  mean.   These  profiles  are 
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also  helpful  in  identifying  the  variables  which  are  signi- 
ficant in  the  cluster  membership.   For  example,  an  important 
variable  would  be  one  that  produces  a  break  in  the  pattern. 

In  the  13  variable  case,  the  profiles  produced  results 
which  indicated  the  lack  of  clearly  dominant  variables  in 
cluster  identification.   With  introduction  of  the  14-th 
variable,  some  very  revealing  results  become  immediately 
apparent.   While  the  cluster  membership  changed  little  in 
going  from  13  to  14  variables,  the  cluster  with  the  highest 
mean  vector  became  clearly  associated  with  the  smallest 
class  sizes.   Similarly  the  cluster  with  the  lowest  mean 
vector  is  characterized  by  a  very  large  class  size.   This 
finding  is  one  of  the  most  significant  results. 

One  of  the  most  critical  decisions  facing  the  analyst 
is  the  number  of  clusters  to  form.   Some  algorithms  based 
on  the  K-MEANS  idea  allow  g  to  change  during  the  clustering 
process,  however  the  MIKCA  method  requires  g  to  be  input  by 
the  user  and  it  does  not  change  in  the  course  of  the  pro- 
gram execution.   Typically  the  investigator  does  not  know 
the  nuiober  of  clusters  in  the  data,  and  he  must  make  some 
educated  guess.   As  pointed  out  earlier,  it  is  possible  for 
several  different,  but  meaningful,  cluster  solutions  to 
exist  in  one  body  of  data. 

The  method  used  to  determine  g  was  to  obtain  solutions 
based  upon  several  values  for  g  and  then  plot  the  criterion 
values  for  each  of  these  solutions.   An  appropriate  choice 
for  g  would  be  a  number  beyond  which  the  marginal  improvement 
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of  the  criterion  becomes  insignificant.   Figures  6  and  7 
are  the  results  of  such  tests  suggesting  that  six  clusters 
represent  the  major  portion  of  the  separating  power  of  the 
algorithm. 

Profile  charts  of  the  cluster  solutions  with  g  =  6  were 
uninteresting.   The  middle  clusters  were  all  bunched 
together  suggesting  that  clusters  were  forced  on  that  part 
of  that  data  where  perhaps  they  did  not  actually  exist  (i.e., 
sparse  data  near  the  boundaries) .   Comparison  results  (table 
3)  indicate  a  much  more  stable  solution  when  g  is  reduced 
below  six. 
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V.   DISCRIMINANT  ANALYSIS 

A.   THEORY 

As  mentioned  earlier,  discriminant  analysis  allows  an 
analyst  to  classify  new  observations  based  on  observations 
which  are  samples  from  known  groups.   Only  Fisher's  linear 
discriminant  function  will  be  used  in  this  study.   It  also 
provides  information  about  the  relative  importance  of  the 
various  variables  in  assigning  an  observation  to  a  group. 
The  linear  discriminant  function  is  based  on  the  assumptions 
of  multivariate  normality  and  homogeneity  of  dispersions. 
The  ability  to  identify  the  dominant  variables  and  the 
dimension  reduction  offered  by  the  discriminant  space  were 
both  extremely  useful  aids  for  analyzing  the  SOF  data. 

These  "more  important"  variables  will  be  earmarked  for 
later  use  in  the  construction  of  Chernof f ' s  FACES.   Also 
of  interest  is  the  plot  of  data  points  in  discriminant 
space.   The  interaction  of  the  coefficients  in  the  dis- 
criminant functions  will  be  seen  as  well  as  the  character- 
ization of  the  dimensions. 

In  order  to  describe  our  usage  of  discriminant  analysis, 
let  us  first  suppose  there  are  only  two  clusters  in  13- 
dimensional  space.   It  is  deisred  to  project  these  two 
clusters  orthogonally  onto  a  line  so  that  the  variation 
between  the  two  groups  is  as  large  as  possible  relative  to 
the  variation  within  the  two  groups.   Finding  the  direction 
of  projection  to  accomplish  this  is  part  of  the  purpose  of 
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discriminant  analysis.   The  solution  provides  a  way  of 
discriminating  between  the  two  clusters  by  a  suitable  linear 
combination  of  the  13  variables.   The  same  theory  is  gen- 
eralized to  g  groups,  where  Wilks  [9]  has  shown  that  a 
projection  to  the  smaller  of  g-1  or  p  dimensions  is  possible 
without  loss  of  information.   Recall  the  earlier  discussion 
that  indicated  this  smaller  number  as  t,  the  number  of  non- 
zero eigenvalues  of  W  B.   The  eigenvalues  are  the  variances 
in  the  direction  of  their  associated  eigenvectors.   One  can 
easily  determine  the  proportion  of  variance  attributable 
to  each  of  the  dimensions  of  discriminant  space  and  also 
the  SOF  items  which  load  most  heavily  in  each  dimension. 

One  gains  insight  into  the  variables  from  examination  of 
the  coefficients  in  the  discriminant  functions.   There  is 
one  function  for  each  dimension,  the  standardized  coeffi- 
cients of  which  are  used  in  this  analysis. 

B .   RESULTS 

Up  to  this  point,  most  of  the  analysis  has  been  perfomed 
on  the  190  courses  in  quarter  773.   A  smaller,  more  manage- 
able data  base  was  needed  to  continue.   Also,  it  seemed 
wise  to  prepare  to  study  individual  departments.   The 
Electrical  Engineering  Department  was  chosen  for  further 
analysis  since  it  is  a  large  department  and  hence  not  too 
small  for  this  purpose.   Over  the  four  quarter  period,  there 
were  116  course  segments  with  valid  SOF  responses.   These  116 
courses  from  the  EE  department  were  the  data  used  in  the 
discriminant  analysis. 
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When  dealing  with  four  clusters,  the  dimensionality  of 
the  discriminant  space  is  three,  and  depending  on  the  size 
of  the  eigenvalues,  perhaps  fewer  dimensions  will  provide 
sufficient  discrimination.   Table  four  gives  the  results  of 
performing  a  discriminant  analysis.   Figure  eight  is  a  graph 
of  the  two-dimensional  discriminant  space  (the  third 
dimension  is  neglected) . 

The  eigenvalues  indicate  94  percent  of  the  total  variance 
is  represented  by  the  first  two  discriminant  functions. 
Figure  eight  corroborates  this  fact  by  depicting  easily 
seen  separation  in  two  dimensions.   Imagine  projecting  the 
points  onto  the  horizontal  axis.   Discrimination  in  the 
first  dimension  would  account  for  73.5  percent  of  the 
variation.   Groups  one  and  four  would  easily  be  separated, 
but  two  and  three  would  overlap. 

Examination  of  the  coefficients  will  enable  one  to  label 
the  dimensions  by  identifying  the  dominant  characteristics 
which  they  measure.   The  first  dimension  is  along  the  hori- 
zontal axis  and  is  associated  with  the  first  discriminant 
function.   The  magnitude  of  the  coefficients  indicates  their 
relative  impact  on  the  dimension.   The  signs  aid  in  under- 
standing which  variables  reinforce  one  another  (matching 
signs)  and  which  tend  to  cancel  (opposite  signs) .   In  the 
first  function  of  table  four,  SOF  item  12  is  the  most  promin- 
ent.  This  question  (see  figure  2)  asks  the  student  to  score 
the  overall  rating  of  the  instructor.   It  is  not  surprising 
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RESULTS  OF  DISCRIMINANT  ANALYSIS  ON  THE  116  COURSES 
IN  EE  DEPARTMENT  OVER  A  FOUR  QUARTER  PERIOD 


DISCRIMINANT 

EIGENVALUE 

RELATIVE 

FUNCTION 

PERCENTAGE 

1 

5.79 

73.6 

2 

1.64 

20.8 

3 

0.44 

5.6 

STANDARDIZED  DISCRIMINANT 
FUNCTION  COEFFICIENTS 


Function  1 

Function  2 

Function  3 

1 

-0.11 

0.23 

-0.22 

2 

0.13 

0.14 

-0.09 

3 

-0.05 

1.46 

0.01 

4 

-0.47 

-0.36 

-0.80 

5 

-0.31 

0.01 

-0.82 

6 

0.08 

0.36 

-0.74 

7 

0.15 

-0.36 

0.12 

8 

0.23 

1.15 

-0.36 

9 

0.36 

-0.82 

-0.47 

10 

0.05 

0.08 

-0.77 

11 

-0.20 

-1.16 

1.18 

12 

-0.72 

-1.08 

1.80 

13 

-0.18 

0.91 
TABLE  4 
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that  this  question  is  very  important  in  the  discrimination 
process.   High  marks  on  items  four  and  five  tend  to  rein- 
force a  high  mark  on  question  12.   Those  questions  are: 

(4)  Difficult  concepts  were  made  understandable. 

(5)  I  had  confidence  in  the  instructor's  knowledge 
of  the  subject. 

Interestingly,  a  high  mark  on  item  nine  (instructor  made 
the  course  a  worthwhile  learning  experience)  tends  to 
diminish  the  effect  of  a  high  score  on  item  12.   The  first 
dimension  is  dominated  by  question  12  and  was  labeled  the 
"popularity"  dimension. 

The  second  dimension  is  depicted  by  the  vertical  axis 
on  the  graph  in  figure  eight,  and  measurements  along  this 
dimension  are  controlled  by  the  second  discriminant  function 
The  separating  power  in  this  direction  is  less  than  one 
third  that  of  the  first.   Note  however  that  the  vertical 
scale  is  compressed  25  percent  more  than  the  horizontal 
scale  (1.5  inches  vertical  =  2.0  inches  horizontal).   Items 
three  and  eight  has  strong  positive  coefficients  whereas 
questions  11  and  12  are  pulling  heavily  in  the  negative 
direction.   However  the  strength  of  the  information  is  not 
great,  and  deeper  interpretation  hardly  seems  worth  the 
effort. 

Only  5.6  percent  of  the  total  variance  appears  in  the 
third  function,  and  it  is  therefore  considered  insignificant 
One  might  note  that  item  12  also  dominates  the  third 
dimension. 
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The  main  purpose  here  has  been  to  identify  variables 
for  use  in  constructing  Chernoff  FACES.   The  discriminant 
analysis  has  served  that  purpose  well,  and  it  has  also 
described  the  character  of  the  dimensions. 
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VI.   CHERNOFF  FACES 

A.   BACKGROUND 

Chernof f ' s  FACES  was  the  second  cluster  method  to  be 
applied  to  the  SOF  data.   The  method  was  used  with  the 
same  purpose  in  mind,  and  it  was  hoped  that  earlier  cluster 
solutions  could  be  reproduced  by  this  method.   Additionally, 
there  was  the  possibility  of  gaining  new  information  about 
the  structure  within  the  data.   Professor  Heirman  Chernof f 
developed  this  graphical  method  for  representing  multivariate 
data.   The  now  familiar  data  point  in  p-space  is  represented 
by  a  computer-drawn  cartoon  of  a  face  whose  characteristics 
(features)  are  determined  by  the  position  of  the  point. 
Features  such  as  nose  length  and  mouth  curvature  correspond 
to  components  of  the  data  point.   In  the  case  of  the  SOF 
data,  each  component  of  the  13-dimensional  vector  can  be 
made  to  control  one  of  20  features,  and  seven  constants  can 
be  selected  for  the  remaining  features.   The  technique  lends 
itself  to  clustering  since  the  investigator  can  group 
together  those  faces  which  resemble  each  other. 

Chernof f  [10]  points  out  that  people  spend  a  great  deal 
of  their  life  studying  and  reacting  to  faces.   The  human 
mind  subconsciously  acts  as  a  high  speed  computer  sometimes 
detecting  barely  measurable  differences  and  ignoring  unim- 
portant differences,  even  if  they  are  large.   Chernof f 
claims  that  unlike  a  machine,  the  mind  has  the  capability 
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to  disregard  non-informative  data  and  search  for  useful 
information.   He  states  that  certain  major  characteristics 
of  the  faces  are  instantly  observed  and  easily  remembered; 
finer  details  and  correlations  become  apparent  after  studying 
the  faces.   Clustering  by  sorting  the  faces  is  certainly 
easier  than  staring  at  a  large  matrix  of  data.   The  method 
has  pitfalls  and  limitations  and  some  of  them  will  be  dealt 
with  in  this  thesis. 

After  the  publication  of  Chernoff's  method  [11],  quite 
a  number  of  people  began  experimenting  with  the  technique. 
Lake  [12]  mentions  a  few  more  successful  applications  of 
Chernoff's  method,  including: 

1.  L.A.  Bruckner  of  Los  Alamos  Scientific  Lab  of  the 
University  of  California  while  studying  the  performance 
of  offshore  oil  companies. 

2.  Johns  Hopkins  University 

a.  Developing  methods  of  psychiatric  screening. 

b.  Monitoring  patients  in  intensive  care  units. 

c.  Monitoring  the  stock  market. 

3.  Dr.  David  L.  Huff  of  the  University  of  Texas  in 

"■      developing  urban  regional  indicators  that  measure  the 
quality  of  life. 

4.  Professor  P.C.C.  Wang  and  Gerald  Lake  at  the  Naval 
Postgraduate  School  in  analyzing  Soviet  naval  penetra- 
tions into  the  Indian  Ocean  and  the  African  littoral; 
and  Soviet  foreign  policy  in  sub-Saharan  African 
states. 
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5.   Professor  Chernoff  in  geological  and  fossile- 

related  experiments. 

The  field  of  computer  graphics  has  experienced  tremendous 
growth  in  recent  years  due  mainly  to  the  state  of  the  art 
in  computers  and  computer  display  equipment  (including  both 
video  and  plotting  types) .   The  adage  that  "a  picture  is 
worth  a  thousand  words"  has  proven  to  be  quite  true.   Recent 
developments  include  on-line  programs  that  perfoirm  statis- 
tical analysis  with  polygon,  bar  graphs /  arrows,  and  scatter 
diagrams.   Three-dimensional  data  displays  have  facilitated 
the  work  of  engineers  and  statisticians  alike. 

An  interesting  application  of  the  FACES  program  is 
Bruckner's  study  of  offshore  drilling  by  oil  companies. 
Figure  9  displays  some  of  his  results.   Two  of  the  features, 
nose  width  and  eye  separation,  are  controlled  by  the  varia- 
bles "expected  years  to  production"  and  "number  of  leases 
won",  respectively.   Other  features  are  controlled  by  a 
variety  of  variables  representing  the  company's  financial 
health  and  growth  potential. 

Reference  to  figure  10  will  help  describe  how  the  faces 
are  constructed.   Table  5  gives  the  range  of  the  variables 
which  control  the  features  and  distance  parameters  of  the 
face. 

The  data  are  first  converted  to  the  X  parameters  as 
follows.   The  variable  Z  is  used  to  control  the  parameter 
X.  which  is  allowed  to  range  from  a.  to  b . . 
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Figure  9 
Bruckner's  Offshore  Hydrocarbon  Producers 
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Figure  10 


Chernoff  Face  with  Ears 
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FEATURE  RANGES  AND  DESCRIPTION 

This  table  is  taken  from  reference  10,  and  the  descriptions 
are  not  complete.   For  a  more  detailed,  mathematical 

explanation,  see  Appendix  C. 


Range 
(0,1) 
(0,1) 
(0,1) 
(0.5,2) 

(0.5,2) 

(0,1) 

(0,1) 

(-5,5) 

(0,1) 

(0,1) 

(0,1) 

(0,1) 

(0.4,0.8) 

(0,1) 

(0,1) 
(0,1) 

(0,1) 
(0,1) 
(0,1) 
(0,1) 


X.  controls 
X2  controls 
X-.   controls 


^4   IS 


Xc    IS 


Xg  controls 
X-  controls 
Xo   controls 


Xq  controls 
X,  pj  controls 
X,,  controls 
X, 2  controls 

^13  ^^ 

X, -  controls 


X, c  controls 

X, r   controls 
16 


X,  -,   controls 
x,g  controls 

^19  ^^ 
^20  ^^ 


m 


m 


X 


!**-( 


distance  from  0  to  P 

angle  between  OP  and  horizontal 

half-height  of  face 

eccentricity  of  upper  ellipse 
of  face  (width/height) 

eccentricity  of  lower  ellipse 
of  face  (width/height) 

length  of  nose 

position  of  center  of  mouth 

curvature  of  mouth  (radius  =  h/Xg) 

length  of  mouth 

height  of  centers  of  eyes 

separation  of  centers  of  eyes 

slant  of  eyes 

eccentricity  of  eyes  (height/width) 

half-length  of  eye  (L^  also 
depends  in  part  on  x, q  and  x,,) 

position  of  pupils 

height  of  eyebrow  center  relative 
to  eye 

angle  of  brow  relative  to  eye 

length  of  brow 

ear  diameter 

nose  width 


TABLE  5 


71 


X.   =   a.  +  (b.  -  a.  )  (^ -) 

1       1      1    1   M  -  m 


where  m  and  M  are  the  observed  minimum  and  maximum  of  Z . 

Chernoff's  technical  report  [10]  presents  a  very  detailed 
description  of  the  geometric  relationship  of  the  features 
in  the  face  construction.   A  few  general  remarks  concerning 
the  geometric  attributes  are  included  here.   The  boundary  of 
the  face  is  formed  by  joining  portions  of  two  ellipses,  an 
upper  and  a  lower.   The  angle  theta  (9)  determines  where 
the  ellipses  meet  and  consequently,  the  height  of  the  ears. 
The  nose  is  a  triangle  centered  at  the  origin.   Both  its 
height  and  width  are  variable.   The  curvature  of  the  mouth 
is  a  portion  of  a  circle,  the  radius  and  center  of  which  are 
also  variable.   The  eyes  are  formed  by  ellipses  whose  angle, 
L  half-length,  and  eccentricity  are  all  controlled  by  variables, 

B.   FEATURE-VARIABLE  RELATIONSHIP 

A  frequent  question  is  whether  some  features  are  more 
informative  than  others.   Some  observers  feel  that  the  eyes 
convey  the  most  information;  others  regard  the  mouth  or  the 
shape  of  the  face  as  the  most  relevant  feature.   The  results 
of  the  discriminant  analysis  identified  the  most  dominant 
variables  in  the  discriminant  space.   Now  these  variables 
must  be  assigned  to  facial  features. 

Chernoff  [13]  himself  conducted  an  experiment  to  evaluate 
the  effect  on  classification  error  of  random  permutations 
in  the  assignment  of  variables  to  features.   He  found  that 
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random  permutations  would  change  the  faces  so  that  a 
classifier  might  increase  or  decrease  his  number  of  errors 
by  a  factor  of  about  25  percent.   Unfortunately,  his  experi- 
ment did  not  evaluate  the  efficiency  of  specific  features . 
His  studies  also  make  no  effort  to  determine  whether  ability 
to  discriminate  depends  on  the  dimensionality  of  the  data. 

Considering  Chernof f ' s  findings,  it  would  seem  that 
the  assignemnt  of  variables  to  features  is  of  minor 
importance.   The  use  of  discriminant  analysis  provides  a 
way  of  detecting  which  variables  are  important,  and  it 
seems  appropriate  to  take  advantage  of  this  valuable  infor- 
mation when  constructing  the  faces.   Moreover,  there  is 
choice  in  the  features  that  are  selected  for  use.   The 
author's  choice  of  the  six  best  features  are  starred  in 
table  6.   The  table  gives  the  complete  list  of  feature- 
variable  combinations.   The  results  of  the  discriminant 
analysis  were  relied  upon  heavily  in  forming  the  variable 
assignments . 

Reference  to  figure  eight  (discriminant  space)  and 
table  6  will  aid  in  the  following  discussion.   In  the  first 
dimension  the  important  SOF  items  are  12  and  4  which  con- 
trol the  mouth  curvature  and  ear  height,  respectively. 
High  scores  on  these  two  items  separate  the  observation 
well  to  the  negative  end  of  the  scale  and  cause  the  face 
to  have  a  big  smile  and  high  ears.   Items  12  and  4  have  the 
same  sign  (negative)  but  item  9  is  associated  with  a  large 
positive  coefficient  and  controls  the  lower  eccentricity  of 
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FEATURE -VARIABLE  COMBINATIONS 


FEATURE 


THREE  DIFFERENT  TRIALS 
CONTROLLED  BY 


13  VAR 

6  VAR 

8  VAR 

1 

FACE  WIDTH 

0.5 

0.5 

0.5 

2 

ANGLE  e 

4 

0.65 

4 

3 

FACE  HEIGHT 

0.7 

0.7 

0.7 

4 

UPPER  ECCENTRICITY 

8 

0.95 

0.95 

*5 

LOWER  ECCENTRICITY 

9 

4 

0.6 

6 

NOSE  LENGTH 

10 

0.45 

9 

7 

MOUTH  CENTER 

0.5 

0.3 

0.5 

*8 

MOUTH  CURVATURE 

12 

12 

12 

9 

MOUTH  LENGTH 

13 

0.7 

0.8 

10 

EYE  HEIGHT 

0.23 

0.23 

0.23 

11 

EYE  SEPARATION 

1 

0.5 

0.5 

*12 

EYE  SLANT 

11 

3 

3 

13 

EYE  ECCENTRICITY 

3 

0.6 

0.6 

14 

EYE  HALF  LENGTH 

6 

0.5 

5 

*15 

PUPIL  POSITION 

2 

9 

13 

16 

EYEBROW  HEIGHT 

0.3 

0.3 

0.3 

*17 

EYEBROW  AInIGLE 

5 

8 

8 

18 

EYEBROW  LENGTH 

0.4 

0.4 

0.4 

19 

EAR  DIAMETER 

0.3 

0.3 

0.3 

*20 

NOSE  WIDTH 

7 

11 

11 

Integer  numbers  are  the  SOF  item  #. 
Decimal  values  are  the  fixed  features 


TABLE  6 
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the  face.   A  low  mark  on  this  item  would  complement  high 
marks  on  items  12  and  4  and  would  be  reflected  in  the  lower 
face  having  small  eccentricity  (more  narrow) . 

Turning  to  the  vertical  axis  (second  dimension)  of 
figure  eight  which  has  21  percent  of  the  total  variance, 
the  dominant  variables  are  3,  8,  and  11,  where  11  is  nega- 
tive; 3  and  8  are  positive.   High  scores  on  items  3  and  8 
separate  the  observation  upward  on  the  vertical  axis  and 
are  reflected  as  highly  eccentric  eyes  and  upper  ellipse. 

Droopy  eyes,  reflecting  a  small  value  for  SOF  item  11, 
tend  to  complement  and  reinforce  the  higher  values  for 
items  three  and  eight.   It  seems  like  a  good  idea  to  use 
the  results  of  the  discriminant  analysis  in  this  way,  but 
it  is  impossible  for  the  viewer  to  know  which  variables  act 
together  and  which  interfere  unless  he  is  told  beforehand. 

A  good  deal  of  exploratory  work  was  carried  out  to 
determine  useful  ranges  for  the  features.   The  more  the 
P  features  are  allowed  to  vary,  the  wider  the  variety  of 
faces  produced.   With  large  ranges,  however,  faces  formed 
from  extreme  data  can  become  very  distorted.   On  the  other 
extreme,  too  little  variability  in  the  ranges  suppresses 
valuable  information  and  hinders  the  clustering  process, 
p     It  appears  that  the  best  ranges  depend  on  the  structure 
and  amount  of  variability  in  the  data.   Every  data  set  has 
its  own  characteristics,  and  it  is  best  to  tailor  each  to 
its  own  best  set  of  ranges .   A  great  portion  of  the  SOF 
data  is  found  "close"  to  the  grand  mean,  but  with  a  few 
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significant  outliners .   In  order  to  provide  discriminating 
ability  among  the  largest  mass  of  the  data,  the  appearance 
of  the  outliers  was  further  accentuated.   The  ranges  were 
set  at  values  which  would  allow  close-in  discrimination, 
but  simultaneously  attempted  to  minimize  the  departure  of 
the  outliers, 

C.   CLUSTERING  THE  FACES 

The  next  step  in  this  research  is  to  cluster  the  faces. 
This  task  was  performed  by  six  students  in  the  Operations 
Research  curriculum.   The  faces  are  shown  in  figure  11;  the 
33  course  segments  from  the  Electrical  Engineering  depart- 
ment in  quarter  781.   The  judges  (students  performing  the 
clustering)  were  given  no  information  concerning  the  feature- 
variable  combination.   They  were  simply  instructed  to  group 
the  faces  in  the  manner  which  best  suited  them.   Fifteen 
minutes  were  allowed  for  the  task.   The  purpose  was  to 
quickly,  but  carefully,  cluster  the  faces.   The  judges 
were  reminded  that  each  face  is  different  and  to  search  for 
the  most  natural  looking  clusters.   It  was  felt  that  too 
much  time  spent  on  this  task  could  defeat  the  purpose  of  the 
faces  as  a  first  pass  look  at  the  data.   In  every  case,  the 
judges  acted  independently  of  one  another.   No  clues  were 
provided  which  might  have  indicated  which  features  were 
more  important. 
■     Figure  11  shows  the  faces  in  the  clusters  which  were 
formed  by  the  MIKCA  algorithm.   This  cluster  structure  was 
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Figure  11 
CLUSTERS  DETERMINED  BY  MIKCA 
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used  as  a  standard  against  which  the  judges'  results  were 
compared.   Table  7  shows  the  results  of  this  experiment. 
There  was  considerable  agreement  among  the  judges,  as 
indicated  by  the  comparison  coefficients.   There  was  also 
a  good  deal  of  similarity  between  the  clusters  formed  by 
the  judges  and  those  formed  by  the  MIKCA  algorithm. 

Several  comments  by  the  judges  indicated  the  difficul- 
ties they  encountered.   The  most  prevalent  comment  was  the 
difficulty  in  deciding  which  feature  to  consider  the  most 
important.   One  judge  considered  the  mouth  first  in  every 
case  while  another  judge  used  the  slant  of  the  eyes  as 
a  more  important  variable.   The  judges  also  indicated  that 
trying  to  evaluate  simultaneously  differences  in  many 
features  was  quite  difficult.   It  is  interesting  to  note 
that  the  judges'  results  were  quite  similar  despite  the 
fact  that  different  criteria  were  employed  as  they  fomed 
the  clusters. 

The  SOF  identification  numbers  have  been  altered  for 
this  report.   There  were  two  course  segments  which  erron- 
eously reported  the  same  SOF  number  (see  face  150) .   As  one 
looks  at  the  faces  with  the  discriminant  space  in  mind,  it 
is  much  easier  to  foinn  a  clustering  which  is  similar  to  the 
MIKCA  solution.   One  would  be  aware,  for  example,  that  the 
position  of  the  pupils  is  critical  in  that  it  can  diminish 
the  effect  of  the  smile  and  impact  heavily  on  the  horizontal 
dimension.   This  effect  can  be  seen  by  referring  to  faces 
139  and  140;  they  are  included  in  a  group  whose  smiles  are 
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TABLE  7 
COMPARISON  COEFFICIENTS  FOR  ALL  PAIRS  OF  JUDGES 


JUDGES 


1 

2 

3 

4 

5 

6 

1.0 

.69 

.73 

.82 

.73 

.78 

1.0 

.90 

.80 

.77 

.68 

1.0 

.65 

.81 

.67 

1.0 

.68 

.73 

1.0 

.69 

1.0 

COMPARISON  COEFFICIENTS 
BETWEEN  EACH  JUDGE  AND  MIKCA 


COMPARISON 

JUDGE 

COEFFICIENT 

1 

.59 

2 

.58 

3 

.68 

4 

.81 

5 

.76 

6 

.73 

SIMULTANEOUS  COMPARISONS  OF 
MULTIPLE  JUDGES 


NUMBER  OF 

JUDGES 

3 

4 

5 


COMPARISON 
COEFFICIENT 

.52 

.44 

.38 
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not  as  large  as  their  own  because  the  position  of  their 
pupils  (positive  to  the  right)  has  diminished  the  impact 
of  the  curvature  of  the  mouth. 

Another  example  of  the  interaction  of  variables  is 
seen  by  referring  to  face  152.   Most  judges  would  quickly 
include  this  face  with  the  group  of  four  at  the  top  of 
figure  11.   A  subtle  difference,  however,  is  the  ear 
position.   Reference  to  the  discriminant  function  coeffi- 
cients will  indicate  a  negative  which  has  offset  the  slant 
of  the  eyes  in  the  second  dimension. 

Knowledge  of  the  discriminant  functions  helps  alleviate 
the  confusion  which  sets  in  when  attempting  to  cluster.   It 
is  especially  true  in  this  case  where  so  little  difference 
exists  between  the  majority  of  the  faces  in  the  middle  groups 

Difficulty  in  evaluating  all  13  features  simultaneously 
was  a  problem.   As  an  alternative  to  this  set  of  faces,  two 
other  sets  were  produced,  one  with  only  six  variables  fea- 
tures and  the  other  with  eight.   Figure  12  contains  samples 
from  these  sets,  12a  the  six  variable  set  and  12b  the  eight 
variable.   Of  course,  not  all  of  the  data  is  represented  in 
this  manner.   Only  those  variables  which  loaded  heavily  in 
the  discriminant  analysis  were  used,  and  the  features  con- 
trolled by  those  variables  are  the  ones  considered  to  convey 
the  most  information.   Table  6  gives  the  complete  feature- 
variable  combinations  used  in  the  construction  of  all  sets 
of  faces.   The  data  used  in  constructing  the  set  of  33  faces 
is  found  in  Appendix  B. 
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D.   PROBLEMS  ENCOUNTERED 

The  last  section  addressed  difficulties  faced  by  the 
judges  because  it  was  impossible  for  them  to  be  aware  of 
the  information  contained  in  the  discriminant  analysis. 
Of  course,  it  would  be  pointed  out  that  there  is  little 
reason  to  use  this  particular  MIKCA  solution  as  the  stan- 
dard, but  it  does  serve  as  an  objective  standard,  as  it 
was  desired  to  compare  the  machine  results  with  the  human 
results.   This  section  addresses  problems  of  a  more 
mechanical  nature. 

Exploratory  work  with  the  faces  uncovered  quite  a 
number  of  relationships  between  the  features.   The  exis- 
tance  of  geometric  dependencies  (not  discriminant-type 
effects)  between  features  caused  difficulties  in  clearly 
displaying  the  variables.   Two  notable  examples  are 
mentioned  here. 

The  length  of  the  mouth  is  quite  dependent  upon  its 
curvature.   The  projection  on  the  horizontal  axis  (no 
relation  to  discriminant  axis)  has  half-length 


a^  =   X9(h/|X8|) 


where  X8  is  the  mouth  curvature.   The  variables  which  con- 
trol these  features  are  thus  automatically  forced  into  this 
dependent  status  regardless  of  their  true  relationship. 
The  other  example  concerns  the  ellipses  forming  the 
I  facial  boundary  and  the  angle  theta.   The  upper  ellipse  is 


drawn  through  the  points  P',  U,  and  P;  the  lower  through 
P',  L,  and  P  (see  figure  10).   Two  faces  with  identical 
values  for  the  ellipses  might  have  quite  different  appearing 
facial  boundaries  due  to  the  dependence  on  theta  for  the 
points  P'  and  P.   This  is  another  example  of  forced  dependence. 

In  order  for  the  width  and  height  of  the  face  to  meet  a 
specified  constant,  the  program  "normalizes"  both  horizontal 
and  vertical  axes.   This  normalization  eliminates  the 
effects  of  XI  and  X3 ,  and  it  adjusts  all  of  the  features 
during  the  process.   It  is  believed  to  be  this  normalization 
process  which  causes  faces  which  are  growing  wider  and  wider 
to  suddenly  revert  to  one-half  the  widest  width  when  the 
width  exceeds  a  threshold  value.   A  similar  phenomenon  is 
experienced  in  the  height  variable.   This  half -size  adjust- 
ment may  be  seen  in  figure  12b.   Face  132  has  been  changed 
by  a  disproportionate  amount  due  to  the  nojrmalization  pro- 
cess.  It  is  of  interest  to  point  out  that  the  face-width 
feature  was  being  held  constant  during  the  construction  of 
that  set  of  faces. 

Yet  another  hidden  dependency  is  that  of  the  nose  length 
on  the  eye  height.   The  eyes  are  located  at  height 

y         =      h[X10    +    CI    -    X10)X6] 

where  X6  controls  the  length  of  the  nose. 

These  and  other  subtle  dependencies  mislead  the  inves- 
tigator if  he  is  not  aware  of  their  existence.   These  problems 
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reduce  the  ability  of  the  faces  to  effectively  display  the 
full  20  dimensions.   Unfortunately,  these  points  are  not 
explicit  in  the  original  document  [11]  and  their  discovery 
was  an  11-th  hour  surprise.   It  was  not  possible  to  adjust 
for  them  or  to  uncover  all  such  relationships  at  this 
writing.   Appendix  C  gives  a  complete  listing  of  the  formulas 
used  in  the  construction  of  the  faces. 
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VII.   COMPARISON  COEFFICIENT 

A.   BACKGROUND  AND  ALTERNATIVES 

Repeated  use  of  the  comparison  coefficient  has  been 
made  in  this  study.   The  present  chapter  is  devoted  to  an 
explanation  of  this  measure  of  association.   The  method 
should  be  flexible  enough  to  handle  multiple  comparisons 
simultaneously,  thus  enabling  one  to  measure  the  overall 
agreement  of  several  judges. 

It  was  decided  the  best  way  to  display  the  agreement 
of  two  judges  was  through  the  use  of  a  contingency  table. 
Table  8  is  an  example  of  one  to  be  used  in  the  discussion, 

Judge  X 


0) 

en 
-a 


A 
B 


B 


5 

0 

1 

1 

3 

3 

Table  8 

The  contingency  table  indicates  the  agreement  of  the  two 
judges.   The  purpose  of  this  chapter  is  to  find  a  measure 
which  evaluates  how  close  this  agreement  is .   Note  that 
judge  X  categorized  the  observations  into  three  clusters 
with  6,    3,    and  4  elements,  respectively.   Of  the  13  observa- 
tions, judge  Y  placed  six  in  one  group  and  seven  in  another 
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The  labeling  of  the  clusters  is  arbitrary.   The  upper  left 
entry  in  the  table  indicates  that  five  of  the  objects  in 
judge  X's  category  A  matched  with  five  of  judge  Y's  category 
A.   The  entire  table  is  interpreted  in  this  manner.   Notice 
that  if  one  chooses  to  call  this  entry  of  five  as  representing 
agreement,  then  the  entry  of  1  below  it  and  the  1  in  the 
top  right  corner  must  represent  some  of  the  observations 
on  which  the  judges  disagreed. 

The  contingency  table  idea  is  easily  generalized  to 
higher  dimensions  (more  than  two  judges) .   In  three  dimen- 
sions, a  box  (or  cube)  would  represent  the  table,  with 
elements  internal  to  the  box  measuring  agreement  between 
three  judges. 

One  method  for  measuring  the  degree  of  agreement  is  to 
find  the  largest  combination  of  entires  such  that  only  one 
per  row  and  one  per  column  are  chosen.   This  task  becomes 
very  difficult  as  the  number  of  clusters  increases,  but  it 
can  be  solved  through  the  use  of  linear  programming  tech- 
niques.  It  is  a  constrained  optimization  with  a  linear 
objective  function  and  is  an  application  of  the  "assignment 
problem."   Unfortunately,  when  generalizing  to  higher  dimen- 
sions, the  L.P.  loses  its  unimodularity  attribute  and  the 
number  of  constraints  and  variables  in  the  problem  becomes 
prohibitively  large. 

The  Chi-square  contingency  statistic  was  considered 
inappropriate  because,  when  using  the  smaller  sample  sizes, 
more  than  20  percent  of  the  cells  have  expected  frequencies 
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of  less  than  five  (see  ref .  14) .   Even  when  using  the 
190  element  sample/  there  were  frequent  occasions  when  this 
same  difficulty  persisted.   The  Chi-square  statistic  was 
not  used  since  it  could  not  have  been  applied  consistently 
throughout  the  analysis. 

Professor  James  Hartman  provided  an  idea  that  led  to 
the  method  finally  put  into  use. 

B.   THE  TECHNIQUE 

The  idea  was  to  sum  the  squares  of  the  entries  in  the 
contingency  table  and  then  "normalize"  this  quantity. 
Summing  squares  offers  an  excellent  method  for  measuring 
the  degree  of  association,  however  the  following  example 
illustrates  the  need  for  some  sort  of  adjustment  factor. 


10 

0 

0 

10 

19 

0 

0 

1 

9a 


9b 


TABLE  9 

Both  tables  represent  perfect  agreement  on  20  observations, 
however  table  9a  has  a  sum  of  squares  equal  to  200  and  9b 
has  a  value  of  362.   It  is  desired  to  indicate  both  of 
these  examples  as  perfect  agreement  with  one  being  no 
better  than  the  other.   Hence,  it  became  necessary  to 
determine  the  "best  possible"  sum  of  squares  in  every  given 
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situation.   A  computer  program  was  written  for  this  purpose 
and  is  included  in  Appendix  F.   The  statistic  which  is 
used  as  a  comparison  coefficient  is  a  number  between  zero 
and  one,  formed  as  the  ratio  of  the  actual  sum  of  squares 
to  the  "best  possible"  sum  of  squares.   The  best  possible 
sum  of  squares  is  a  computed  sum  using  a  minimax  approach  and 
is  based  on  the  number  of  judges,  number  of  clusters  by 
each  judge,  and  the  number  of  observations  within  each 
cluster.   The  minimax  procedure  does  not  need  to  know  which 
observations  make  up  a  cluster,  only  how  many  observations. 
An  example  showing  the  computation  of  the  comparison 
coefficient  is  given  in  Appendix  E. 

This  method  for  measuring  the  degree  of  agreement 
provides  the  analyst  a  standard  scale  upon  which  to  compare 
coefficients  based  on  solutions  involving  varying  numbers 
of  clusters  and  cluster  memberships,  as  well  as  varying 
numbers  of  judges. 

In  order  to  provide  some  sensitivity  for  the  signifi- 
cance of  this  measure,  several  cluster  solutions  were 
fo2nned  wholly  at  random  and  compared  to  results  produced 
by  MIKCA  and  the  judges.   In  every  case,  the  values  of  the 
comparison  coefficients  were  less  than  0.1. 
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VIII.   SUMMARY  AND  CONCLUSIONS 

This  research  has  been  largely  exploratory.   A  path 
has  been  paved  for  others  to  follow  in  examining  the  SOF 
data.   The  theory  of  cluster  analysis  and  its  relationship 
to  discriminant  theory  have  been  carefully  examined  with 
emphasis  on  two  widely  divergent- techniques .   In  the  analysis 
of  the  data,  attempts  have  been  made  to  identify  the  under- 
lying structure  of  which  the  clusters  are  a  consequence. 
This  chapter  is  devoted  to  separate  discussions  of  the 
cluster  methodology  explored  and  the  interpretation  of 
the  SOF  data. 

Although  the  development  of  methodology  phase  of  the 
research  was  carried  to  completion  in  a  general  sense,  a 
number  of  problems  were  encountered  along  the  way.   Many 
of  these  problems  are  deserving  of  deeper  treatment  and 
are  discussed  below. 

The  data  transformation  was  the  best  of  the  three  con- 
sidered.  It  produced  the  smallest  test  statistic  for  homo- 
geneity of  covariances ,  but  the  value  itself  was  not  in 
the  acceptable  range,  based  on  normal  theory.   It  should 
be  possible  to  improve  the  choice. 

The  modifications  of  MIKCA  to  allow  for  weighting  of 
the  input  vectors  has  been  effected  and  well  tested.  It 
is  an  important  added  capability  for  this  program. 

The  use  of  discriminant  analysis  to  discover  the 
important  variables  affecting  the  clustering  is,  no  doubt, 
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not  new.   It  needs  some  refinement,  however,  because  it  is 
not  clear  how  one  should  rate  the  importance  of  variables 
supporting  the  first  dimension  to  those  supporting  the 
second  (or  any  other  dimension) .   Such  a  set  of  priorities 
could  be  most  useful. 

The  idea  of  using  the  important  variables  (and  their 
P  signs)  of  the  discriminant  functions  in  the  problem  of 

assigning  sets  of  variables  to  sets  of  features  is  believed 
to  be  new.   It  may  have  great  potential  in  providing  a  way 
for  the  Chernoff  face  technique  to  replace  the  more  expensive 
technique  based  on  computer  iteration. 

The  present  attempt  to  work  with  the  faces  was  disappointing 
This  is  due  largely  to  the  face  that  certain  restrictions, 
truncations,  and  discontinuities  in  the  movement  of  the 
features  were  not  well  dociimented  in  our  sources.   Their 
discovery  came  as  a  surprise  late  in  the  research  and  it 
was  not  feasible  to  go  back  and  readjust.   Such  readjustment 
is  clearly  called  for  and  would  require  a  substantial  effort 
in  the  future  studies. 

The  coefficient  of  comparison  was  a  new  idea  and  there 
was  insufficient  time  to  explore  its  properties.   What  is 
needed  is  more  investigation  in  order  to  interpret  its 
I   various  values  (or  another  measure  whose  values  are  inter- 
pretable) .   The  comparison  measure  is  also  useful  in  the 
problem  of  assigning  variables  to  features  when  working  with 
faces.   The  goal  is  to  choose  assignments  having  the  property 
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that  the  judges  are  in  good  agreement  when  forming  the 
clusters. 

The  study  of  the  SOF  data  which  was  made  while  developing 
the  cluster  methods  produced  results  about  student  evalua- 
tions of  courses  (and  instructors) .   The  results  are 
discussed  below. 

A  principal  components  analysis  of  the  data  swarm  of 
mean  vectors  showed  it  to  be  essentially  one  dimensional  and 
having  the  direction  of  the  main  diagonal  of  13-space.   The 
interpretation  os  this  is  that  all  13  items  are  equally 
important  in  the  students '  perception  of  rating  the  course 
and  its  instructor.   On  the  other  hand,  this  same  effect 
would  be  produced  by  careless ,    perfunctory  completion  of 
the  forms  by  many  students. 

The  partitioning  of  the  data  into  three  or  four  clusters 
by  MIKCA  is  more  or  less  successful.   The  clusters  are  not 
sharply  separated  (there  are  no  great  voids  between  them) . 
Study  should  be  made  to  see  how  much  the  density  of  the 
data  diminishes  near  the  boundaries  of  the  partitions. 

Although  the  main  data  swarm  is  essentially  one  dimen- 
sional, it  appears  useful  to  use  two  dimensions  to  describe 
the  individual  partitions  after  clustering.   In  doing  this, 
variable  12  (overall  rating  of  the  instructor)  emerged  as 
most  important  in  the  first  dimension  and  variables  3,  8, 
and  11  giving  support  in  the  second.   Only  one  discriminant 
study  is  reported  here,  although  several  were  performed. 
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Variable  12  appears  to  have  permanence  while  the  other 
variables  often  shift  in  importance. 

The  cluster  profiles  which  track  the  cluster  centroids 
over  the  13  variables  provide  a  set  of  (almost)  horizontal 
lines.   This  supports  further  the  one  dimensional  interpre- 
tation of  the  data  swarm.   The  result  is  not  sensitive  to 
whether  or  not  the  data  are  standardized. 

The  results  of  applying  the  modified  MIKCA  did  not  vary- 
greatly  from  the  results  of  applying  the  original  MIKCA. 
Hence  the  number  and  composition  of  the  clusters  is  not  dis- 
turbed much  by  the  variability  in  class  size. 

The  relative  position  of  the  clusters  is  strongly  and 
inversely  related  to  class  size.   The  courses  that  receive 
uniformly  high  ratings  are  associated  with  the  small  class 
sizes  and  the  courses  receiving  uniformly  poor  ratings  are 
associated  with  the  large  class  sizes. 

All  judges  reported  use  of  a  hierarchical  approach  to 
separate  the  faces  into  clusters.   Most  judges  first  separated 
the  faces  into  two  groups  according  to  the  curvature  of  the 
mouth  (smile  or  frown) .   There  was  little  agreement  about 
which  features  were  important  in  further  subdividing  the 
two  main  groups,  hence  some  disparity  resulted  in  their 
final  cluster  solutions. 

The  MIKCA  procedure  is  a  sophisticated  approach  to  cluster 
analysis;  its  results  are  based  on  sound  statistical  theory. 
The  modified  version  of  that  computer  program  is  considered 
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particularly  well  suited  to  the  SOF  data  or  any  other  data 
set  possessing  the  same  predetermined  class  structure.   The 
impact  of  class  size  on  cluster  membership  has  been  empha- 
sized.  This  important  issue  may  indicate  the  smaller  classes 
receive  artificially  inflated  SOF  scores.   Consideration  to 
this  fact  surely  must  be  given  by  those  who  use  these  scores 
as  a  means  for  evaluating  teacher  performance . 


I 
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APPENDIX  A 
Test  for  the  Equality  of  Dispersion  Matrices  of  k  Groups 
Given  a  sample  of  k  groups  and  m  variables  with  group 


d 


ispersion  matrices ,    S . ,     (i  =  1 ,  . . . ,  k)  pooled  within- 


groups  dispersion  matrix  S  /  and  total  sample  observations 

k 
N  =   I  N  ,  Box  shows  that  the  hypothesis 
g=l  ^ 


may  be  tested  by  an  F  statistic  developed  from 

k 
A     =      ln[|S^|]  .  [N-k]    -      I    (N^   -    1)  .    ln(|S^l) 

i=l 

k 


B      = 


i=l 


C      = 


D      = 


E      = 


6(k-  1)  (m+  1) 


k 

[    I   ^ J j]     .     (m-1)     .     (m  +  2 

1=1^^   +   N^)^         (N   -   k)^ 
6(k-  1) 


(k  -  1)     •    m    .     (m+  1) 


D   +    2 


abs      B^   -  C 
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2  ... 

If  B  >   C,    the  test  statistic  is 


(K\   r  A(l-B+  2/E)   .   ^    D 
^D^  ^E-  A(l  -  B+  2/E)  ^       E 


2  ... 

If  C  >  B  /  the  test  statistic  is 


(|)  (1  -  B  -  D/E)   -   F° 
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APPENDIX  C 

The  following  information  is  taken  directly 
from  Chernoff's  technical  report  [9] 


Construction  of  Faces 

Given  18  numbers  (x, ,X2 / • . • /X, g)  in  appropriate  ranges 
(which  will  usually  be  0  to  1) ,  we  define  a  face  (see 
Fig.  3)  as  follows.   Let  H  be  a  nominal  distance  and  let 
h*  =  ■j(H-x-,)H  be  the  distance  from  the  origin  to  a  "corner" 
point  P.   As  X,  varies  from  0  to  1,  h*  varies  from  H/2  to 
H.   Let  9*  =  (2x2-1)  tt/4  be  the  angle  of  OP  with  the  hori- 
zontal.  Let  P '  be  a  point  symmetric  to  P  about  the  verti- 
cal axis  through  0.   Let  h  =  2-(l+x-«)H  represent  the  distance 
from  0  to  U  the  top  of  the  head  and  L  the  bottom  of  the 
head,  both  on  the  vertical  line  through  0.   The  upper  part 
of  the  head  is  an  ellipse  which  is  determined  by  P',  U, 
and  P  and  an  eccentricity  x..   Let  x.  represent  the  ratio 
of  the  width  to  height  of  the  upper  ellipse.   Similarly, 
Xc  is  the  same  ratio  for  the  ellipse  through  P',  L,    and  P. 
The  nose  is  a  vertical  line  of  length  2hXg  with  0  as  center. 
The  mouth  intersects  the  vertical  line  extended  through  the 
nose  at  a  point  P  whose  distance  below  0  is  h [x^+ (l-x_) x^] . 
This  represents  a  point  x_  part  of  the  way  from  the  bottom 
of  the  nose  to  U.   The  mouth  is  part  of  a  circle  whose  center 

is  h/Xo  above  P  .   Thus  a  positive  value  of  Xq  yields  a 
'8m  ^  8  " 

smile.   The  mouth  is  symmetric  about  the  vertical  axis 
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through  0.   Its  projection  on  the  horizontal  axis  has 

the  half-length  a^  =  XQ(h/|xg|)  unless  (h/|xg|)  exeeds  the 

half-width  w^  of  the  face  at  the  height  of  P  .   In  that 

case  XqW   is  used.   The  eyes  are  located  at  a  height 

y   =  h [x, Q+(l-x, Q)Xg]  above  0  and  at  centers  which  are 

X  =  w^(1+2xtt)/4  from  the  vertical  axis  where  w   is  the 
e    e     1±  e 

half-width  of  the  face  at  the  height  y  .   They  are  symmetrically 
slanted  at  an  angle  8  =  (2x,  2"1)  "^/S  with  the  horizontal. 
The  eyes  are  ellipses  with  eccentricity  x,^  (height/length 
before  slanting)  and  half-length  L  =  x,.min(x  ,w  -x  ). 

The  only  asymmetry  appears  in  the  location  of  the  pupils 

which  move  together  an  amount  r  (2x,--l)  from  the  center 

2       2    2   —1/2 
Of  the  eye  where  r  =  (cos  0  +  sin  0/xi^)   ^  L  is  the 
^  e  -L3      e 

horizontal  half-length  of  the  slanted  eye  at  height  y  . 

Finally  the  eyebrows  are  symmetrically  located  with 
centers  at  a  height  y,  =  2(x,  ^+.3)L  x, -.  above  the  eye  centers 
and  slant  2(x-_-1)7t/5  with  respect  to  the  eye,  i.e., 
9**  =  9+ (2x,  _-l)  7t/5  with  respect  to  the  horizontal  and 
half-length  I^  =  r  ^{2yi^Q  + 1) /2. 

One  final  step  taken  by  the  programmer  and  which  has 
been  left  intact,  is  to  normalize  both  horizontal  and 
vertical  axes,  each  by  a  multiplicative  factor,  so  that 

the  width  of  the  head  at  its  widest  part  and  its  height 

I 

are  both  equal  to  a  specified  constant.   This  step,  which 

essentially  removes  two  degrees  of  freedom,  was  left 
unaltered  for  intuitive  and  aesthetic  reasons  that  are 
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somewhat  vague  and  may  require  reconsideration  when  dealing 
with  18-dimensional  data.   In  the  meantime,  the  effects 
of  X,  and  x^  are  almost  but  not  completely  eliminated 
because  of  the  secondary  effects  of  the  normalization, 
which  will  adjust  all  of  the  other  features  at  the  same 
time  as  the  width  and  height  are  normalized. 

Most  of  the  parameters  x.  are  adjusted  to  range  within 
a  subinterval  of  (0,1).   The  exceptions  are  two  of  the 
eccentricities,  x.  and  x_ ,  and  the  parameter  controlling 
curvature  of  the  mouth,  Xq .   Ordinarily,  x.  and  x-  are  kept 
within  1/2  to  2,  and  Xg  is  kept  within  (-5,5).   The  eccen- 
tricity of  the  eye  x, ^  has  usually  been  kept  within  (.4,. 8), 
Some  of  the  ranges  must  be  controlled  carefully.   We  do  not 
want  negative  length  eyes.   Others  need  not  be  so  carefully 
controlled.   It  is  no  calamity  to  have  eyes  extend  beyond 
the  face. 

When  the  two  ellipses  of  the  head  meet  smoothly,  the 
corner  point  P  is  lost,  and  the  variable  x^  loses  effect. 
Restricting  x.  and  x_  to  widely  separated  ranges  seems  to 
avoid  this  problem. 

Data  are  converted  to  the  x  parameters  as  follows .   If 
the  variable  Z  is  used  to  control  the  parameter  x. ,  which 
is  to  be  allowed  to  range  from  a.  to  b. ,  we  let 


X.   =   a.  +  (b.-a.) 


Z  -  m 


M  -  m 


where  m  and  M  are  the  observed  minimum  and  maximum  of  Z . 
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Formulae  Used  on  the  Construction 

We  describe  a  few  of  the  less  trivial  formulae  used 

in  the  construction  of  the  faces. 

The  point  P  has  coordinates  x  =  h*  cos  9*  and  y   =  h*  sin  9* 
^  o  -^  o 

The  ellipse  through  PUP'  has  equation 

2    (y  -  c  )^ 

^  +       ^    =   1 

2       ^2        ^ 
a        b 

u        u 


where 


and 


b  =  h  -  c   / 

u        u 


a  =  X  .b 
u    4  u 


2 

,  X 

c   =  ^[(h+y^) 5 — ] 


The  ellipse  through  PLP '  has  equation 

2       V.2 
where 


b^  =  h  +  c^  , 


^L  =  ^5^L 


and 
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2 

c    =   1  [  (-h+y^)  -   ,   "°    ] 

X5(-h-y^) 


The  head  is  then  described  by  (±x(y),y)  where 


x(y)   =  x4[b^-(y-c^)2]^/2      Yo  1  y  1  h 


=   X5  [b^-  (y-Cj^)  ^J  ^/2      -h  1  Y  1  Yq 


The  mouth  is  a  circular  arc  with  curvature  |xo/h 

through  (0,y  )  where  y   =  -h(x_+  (l-x_)x^).   It  is 
m         m       7       75 

described  by 


y  =  y^  +  (sgn  XglL-^-  /(^)2  -  x^], 

^  0  <  X  <  a_ 

—   —  m 


where 


a^  =   Xg  min[x(y^) ,h/|xg| ] 


The  eyes  are  nominally  centered  at  (x  ,y  )  where 


y^   =   h[xT^+(l-x^Q)Xg] 


x^   =   x(y^)  [l+2x, ,]/4 
e        e      11 
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and  have  half-length 


^e   "   ^14  i^in[x^,x(y^)-x^] 


Let  (u,v)  be  the  coordinates  of  an  ellipse  with  center  at 

the  origin,  half-length  L   and  eccentricity  x,^.   Then 

2    2  1/2 
V  =  x^- (L  -u  )     describes  part  of  the  ellipse.   A 

similar  part  of  the  slanted  eye  can  be  described  for 

0  <  u  <  L  by 


X  =   X  +  u  cos  9  -  V  sm 
e 


y   =   y   +  u  sin  9  +  v  cos 


and  symmetry  is  used  to  complete  both  eyes. 

To  place  the  pupils  within  the  eyes,  both  are  moved 
a  distance  r  (2x, ^ ~  D  from  the  center  of  the  eye,  where 

r^,    the  horizontal  half-length  of  the  slanted  eye  at 

2    2  1/2 
height  y  ,  is  (u  + v  )  ^   when  v/u  =  tan  9.   This  yields 


.     f         2.^-2.2  av-1/2 
r   =   L   cos   9  +  x, ^  sin   9 
e      e  13 


The  program  then  normalizes  all  heights  and  widths  by 
multiplicative  factor  k/h  and  k/max  x(y)  respectively. 
Currently  k  is  set  at  2  inches. 
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APPENDIX  E 
An  Example  of  the  Comparison  Coefficient 

Given  two  judges  who  cluster  20  observations  (numbered 
1  through  20)  into  groups  as  shown  below: 


Judge  X 

Cluster   1    1,2,3,4,7 

Cluster   2    5,6,11,12,13 

Cluster   3    8,9,10,14,15, 
16,17,18 

Cluster   4    19,20 


Judge  Y 
cluster   1    5,6,7 
cluster   2    1,2,3,4,9,15 
cluster   3    8,11,12 


cluster   4    10,13,14,16,17, 
18,19,20 


The  contingency  table  appears  below  with  marginal  (row) 
totals. 


Judge 
Y 


1 
2 
3 
4 


Judge  X 
2     3     4 


1 

2 

0 

0 

4 

0 

2 

0 

0 

2 

1 

0 

0 

1 

5 

2 

3 
6 

3 
8 

20 


Step  1:   Find  the  sum  of  squares  of  the  table  entries 


1+4+16+4+4+1+1+25+4   =   60 
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step  2:   Find  best  possible  sum  of  squares. 
(Read  in  two  columns) 


Judge  X              Judge 

Y 

#  obs  in  clusters    #  obs 

in 

clusters 

Subtracting 

3 

5 

0 

0 

6 

5 

1 

0 

3 

8 

1 

0 

8 

Max    8 

2 
8 

0 
Max    1 

2 
2 

Min  of  Max' s   =   8 

Minimax  = 

1 

Subtract  Minimax  from 

the  max  element  and  repeat 

Subtr< 

acting 

3        5 

0 

0 

6        5 

0 

0 

3        0 

1 

0 

0         2 

Max   6        5 

0 

Max  1 

1 
1 

Minimax   =   5 

Minimax 

=  1 

Subtracting 

Subtracting 

3        0 

0 

0 

1        5 

0 

0 

3        0 
0        2 

0 
0 

0 

0 

Max    3 
Minimax 


Finished  Step  2 


=   3 


Max 
Minimax 


Subtracting 

0  0 

1  2 
3        0 

JO       _2 

3         2 


=   2 


112 


step  3:   Sum  the  squares  of  Minimax's 


54+25+9+4+1+1   =   104 


Best  possible  sum  of  squares   =   104 


^r- '     .       _^  Actual  sum  of  squares 

Comparison  coefficient  =  ^ — r— 7^ nri tP" s— 7; 

^  Best  Possible  Sum  of  Squares 


m  =    °-^« 
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